Human Tracking Using Convolutional Neural ... - Semantic Scholar

Report 7 Downloads 222 Views
Human Tracking Using Convolutional Neural Networks IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 21, NO. 10, OCTOBER 2010 Jialue Fan, Wei Xu, Ying Wu, Yihong Gong

1

Outline • INTRODUCTION • WHAT IS CNN? • CNN TRACKING • Detail Description • From Shift-Invariant to Shift-Variant • Training Procedure • Handling the Scale Change • EXPERIMENTAL RESULTS • CONCLUSION

2

Outline • INTRODUCTION • WHAT IS CNN? • CNN TRACKING • Detail Description • From Shift-Invariant to Shift-Variant • Training Procedure • Handling the Scale Change • EXPERIMENTAL RESULTS • CONCLUSION

3

INTRODUCTION • The major challenge of the traditional learning-based and/or trackingby-detection methods is the false positive matches that lead to wrong association of the tracks (drift)

4

INTRODUCTION • We extract both spatial and temporal structures (motion information) by considering the image pair of two consecutive frames rather than a single frame • In this paper, we use convolutional neural networks (CNNs) • Shift-invariant? → Shift-variant • Local features & Global features • Scale change? → Key Points

5

Outline • INTRODUCTION • WHAT IS CNN? • CNN TRACKING • Detail Description • From Shift-Invariant to Shift-Variant • Training Procedure • Handling the Scale Change • EXPERIMENTAL RESULTS • CONCLUSION

6

WHAT IS CNN? • Fully Connected Neural Networks

Spatial structure?

Head

Body

Legs

7

Convolutional Neural Networks (CNN)

8

Convolutional Neural Networks (CNN) 28 x 28

9

Local receptive fields

10

Local receptive fields 28 x 28

24 x 24

sometimes a different stride length is used

11

Shared weights and bias Sigmoid function =

12

Shared weights and bias • To do image recognition we'll need more than one feature map

LeNet-5 used 6 feature maps

13

Shared weights and bias • A big advantage of sharing weights and biases is that it greatly reduces the number of parameters • Fully connected : 28 x 28 (input neurons) x 30 (hidden neurons) + 30 (bias) = 23550 • Convolutional : 5 x 5 (shared weights) x 20 (feature maps) + 20 (bias) = 520 (30 = 780)

14

Pooling layers • Simplify the information in the output from the convolutional layer • Max-pooling (2 x 2)

24 x 24

12 x 12

15

Outline • INTRODUCTION • WHAT IS CNN? • CNN TRACKING • Detail Description • From Shift-Invariant to Shift-Variant • Training Procedure • Handling the Scale Change • EXPERIMENTAL RESULTS • CONCLUSION

16

CNN TRACKING

17

Outline • INTRODUCTION • WHAT IS CNN? • CNN TRACKING • Detail Description • From Shift-Invariant to Shift-Variant • Training Procedure • Handling the Scale Change • EXPERIMENTAL RESULTS • CONCLUSION

18

Detail Description 1. Each normalized patch (48 x 128) of the image pair is split into five input feature maps (R/G/B/𝐷𝑥 /𝐷𝑦 ) 2. 𝐶1 is a convolutional layer with 10 feature maps connected to a 5 × 5 neighborhood of the input → 44 x 124 ※ 𝐶1 (k, i, j) is the value at position (i, j) in the kth feature map of layer 𝐶1 3.

19

Detail Description 4. The global branch aims to enlarge the receptive field The local branch aims to discover more details about local structures

20

Detail Description

ηλ, 0~𝑝−1

λ=9 p=4 η9, 3 =4

21

Detail Description

[10] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradientbased learning applied to document recognition,” Proc. IEEE, vol. 86, no. 11, pp. 2278–2324, Nov. 1998

• Kλ =p x 3 x 7 trainable

randomly fourtimes chooseupsampling? 10field feature Global receptive = 28maps x 68

22

Detail Description • Each unit in each feature map is connected to a 7 × 7 neighborhood of layer S1

23

Detail Description • The convolution filter of the probability map is a linear function followed by a sigmoid transformation trainable

24

Outline • INTRODUCTION • WHAT IS CNN? • CNN TRACKING • Detail Description • From Shift-Invariant to Shift-Variant • Training Procedure • Handling the Scale Change • EXPERIMENTAL RESULTS • CONCLUSION

25

From Shift-Invariant to Shift-Variant

high detection score

depends on the object’s previous location 𝐿𝐴 , 𝐿𝐴′ are similar 𝐺𝐴 larger than 𝐺𝐴′

26

Outline • INTRODUCTION • WHAT IS CNN? • CNN TRACKING • Detail Description • From Shift-Invariant to Shift-Variant • Training Procedure • Handling the Scale Change • EXPERIMENTAL RESULTS • CONCLUSION

27

Training Procedure • The offline training set includes around 20 000 samples • The dataset was collected by NEC Laboratories • We manually annotate the bounding boxes of human heads • Given one bounding box centered at 𝑥𝑡−1 , we extract 𝑁𝑡−1 (𝑥𝑡−1 ) and 𝑁𝑡 (𝑥𝑡−1 ) as the CNN input

28

Training Procedure

29

Training Procedure • The model is trained offline using standard stochastic gradient descent by minimizing the difference between the probability map outputted by the CNN and the target probability map • Once the model is trained, it is fixed during tracking (Unlike adaptive models)

30

Outline • INTRODUCTION • WHAT IS CNN? • CNN TRACKING • Detail Description • From Shift-Invariant to Shift-Variant • Training Procedure • Handling the Scale Change • EXPERIMENTAL RESULTS • CONCLUSION

31

[30] R. T. Collins, “Mean-shift blob tracking through scale space,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Madison, WI, Jun. 2003, pp. 234–240

Handling the Scale Change • In [30] a solution to scale integrated in mean-shift framework is presented • In our method, we calculate the scale by detecting the key points of the target

Trained not independently!

32

Handling the Scale Changes • 𝑠𝑡 : the scale of 𝑥𝑡

33

Outline • INTRODUCTION • WHAT IS CNN? • CNN TRACKING • Detail Description • From Shift-Invariant to Shift-Variant • Training Procedure • Handling the Scale Change • EXPERIMENTAL RESULTS • CONCLUSION

34

EXPERIMENTAL RESULTS Setup and Comparison Baseline • No testing sequences shown here have been trained • C++ ,Intel 3.6-GHz desktop, 10–15 frames/s on average (640 × 480) • Compared with : 1. [5] Ensemble tracker (learning-based method) not contain scale 2. [23] Support vector tracker 3. [31] Mean-shift tracker 4. [32] Adaptive model (clustering method) • We still use 48 × 128 image patches as our CNN input when only some parts of the whole human body

35

EXPERIMENTAL RESULTS • [5] S. Avidan, “Ensemble tracking,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 29, no. 2, pp. 261–271, Feb. 2007 • [23] S. Avidan, “Support vector tracking,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 26, no. 8, pp. 1064–1072, Aug. 2004 • [31] D. Comaniciu, V. Ramesh, and P. Meer, “Real-time tracking of non-rigid objects using mean shift,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Hilton Head, SC, Jun. 2000, pp. 142–149 • [32] M. Yang, Z. Fan, J. Fan, and Y. Wu, “Tracking non-stationary visual appearances by data-driven adaptation,” IEEE Trans. Image Process., vol. 18, no. 7, pp. 1633–1644, Jul. 2009

36

EXPERIMENTAL RESULTS Impact of Using Temporal Features

Both spatial and temporal

Only spatial

37

EXPERIMENTAL RESULTS Impact of Using Temporal Features

Both spatial and temporal

Only spatial

38

EXPERIMENTAL RESULTS Impact of Shift-Variant Architecture

CNN tracker

39

EXPERIMENTAL RESULTS Impact of Shift-Variant Architecture

CNN tracker

40

EXPERIMENTAL RESULTS Impact of Shift-Variant Architecture

CNN tracker only global branch

41

EXPERIMENTAL RESULTS Impact of Shift-Variant Architecture

CNN tracker only global branch

42

EXPERIMENTAL RESULTS Impact of Shift-Variant Architecture

CNN tracker only local branch

43

EXPERIMENTAL RESULTS Impact of Shift-Variant Architecture

CNN tracker only local branch

44

EXPERIMENTAL RESULTS Impact of Shift-Variant Architecture

Shift-invariant

45

EXPERIMENTAL RESULTS Impact of Shift-Variant Architecture

Shift-invariant

46

EXPERIMENTAL RESULTS Impact of Shift-Variant Architecture

Ensemble tracker

47

EXPERIMENTAL RESULTS Impact of Shift-Variant Architecture

Ensemble tracker

48

EXPERIMENTAL RESULTS Impact of Shift-Variant Architecture

Support vector tracker

49

EXPERIMENTAL RESULTS Impact of Shift-Variant Architecture

Support vector tracker

50

EXPERIMENTAL RESULTS Drift Correction • For the purpose of drift correction, we generate some small random shifts in the training procedure

51

EXPERIMENTAL RESULTS

52

EXPERIMENTAL RESULTS Quantitative Experiments

Shopping

Occlusion

53

EXPERIMENTAL RESULTS Quantitative Experiments

54

EXPERIMENTAL RESULTS Tracking with Partial Occlusion

CNN tracker

55

EXPERIMENTAL RESULTS Tracking with Partial Occlusion

CNN tracker

56

EXPERIMENTAL RESULTS Tracking with Partial Occlusion

Ensemble tracker

57

EXPERIMENTAL RESULTS Tracking with Partial Occlusion

Ensemble tracker

58

EXPERIMENTAL RESULTS Tracking with Partial Occlusion

Support vector tracker

59

EXPERIMENTAL RESULTS Tracking with Partial Occlusion

Support vector tracker

60

EXPERIMENTAL RESULTS Illumination Changes

CNN tracker

61

EXPERIMENTAL RESULTS Illumination Changes

CNN tracker

62

EXPERIMENTAL RESULTS Illumination Changes

Mean-shift tracker

63

EXPERIMENTAL RESULTS Illumination Changes

Mean-shift tracker

64

EXPERIMENTAL RESULTS Illumination Changes

Clustering method

65

EXPERIMENTAL RESULTS Illumination Changes

Clustering method

66

EXPERIMENTAL RESULTS Scale and View Changes

CNN tracker

67

EXPERIMENTAL RESULTS Scale and View Changes

CNN tracker

68

EXPERIMENTAL RESULTS Scale and View Changes

Mean-shift tracker

69

EXPERIMENTAL RESULTS Scale and View Changes

Mean-shift tracker

70

EXPERIMENTAL RESULTS Scale and View Changes

Clustering method

71

EXPERIMENTAL RESULTS Scale and View Changes

Clustering method

72

EXPERIMENTAL RESULTS

73

EXPERIMENTAL RESULTS

74

EXPERIMENTAL RESULTS

75

EXPERIMENTAL RESULTS Discussion • In our experiment, we treat humans as our subject for tracking • The target is related to its spatial context that is induced by the pixels in its vicinity (shoulder, torso) • Trained using 1000 samples (drift due to cluttered background) • How to select representative training samples is nontrivial and may be our future work

76

Outline • INTRODUCTION • WHAT IS CNN? • CNN TRACKING • Detail Description • From Shift-Invariant to Shift-Variant • Training Procedure • Handling the Scale Change • EXPERIMENTAL RESULTS • CONCLUSION

77

CONCLUSION • Learning method for tracking based on CNNs • Spatial and temporal structures • Shift-variant architecture • Global features and local features • Key points solve the scale problem • The main limitation is that the CNN model is not designed to handle full and long-term occlusions by the distracter of the same object class

78

Thanks for listening!

79