arXiv:1512.05830v1 [cs.CV] 18 Dec 2015
Learning Deep Convolutional Neural Networks for Places2 Scene Recognition Li Shen1 1
Zhouchen Lin2
Qingming Huang1
University of Chinese Academy of Sciences, Beijing, China. Email:
[email protected],
[email protected] 2
Key Lab. of Machine Perception (MOE), School of EECS, Peking University. Email:
[email protected] (Corresponding Author)
December 21, 2015 Abstract We describe the optimization algorithm and methodologies we used in the ILSVRC 2015 Scene Classification Challenge, which we won the first place.
1
Motivation
Deep learning has been very successful in recent years [1]. However, training deep neural networks effectively and efficiently is a long standing problem. A widely adopted training algorithm is the celebrated error back-propagation (BP) algorithm [1]. However, naive implementation of BP suffer several important issues, such as overfitting and gradient vanishing/exploding [1]. In recent years, much effort has been spent on overcoming these issues [1, 3, 4]. For example, refined initialization [3] and batch normalization [4] can greatly reduce the risk of gradient vanishing/exploding at lower layers. Figure 1 shows that such a difficulty can be overcome well. However, we observe that deeper neural networks may not lead to better performance, although deeper networks have higher representation power. This can be exemplified by Table 1, where we can see that simply increasing the depth of neural network may hurt the performance. Table 1: Top-5 testing errors (%) of VGG-like architectures, with different depths. Depth Top-5 err.
19 18.93
1
22 19.00
25 19.21
(a)
(b)
Figure 1: Gradient vanishing/exploding has been overcome well by the existing techniques, such as [3, 4]. (a) The mean magnitudes of gradients at different layers. (b) The mean magnitudes of gradients divided by the mean magnitudes of weights at different layers. Why this phenomenon happens? One may guess that this may be due to overfitting. However, we did not observe overfitting in our experiments. Since the gradient vanishing/exploding is not an issue either, there must be other reasons. We consider this problem is an information processing viewpoint. We would view the error BP process as an information propagation process. Then by the Data Processing Theorem [2], which basically states that data processing destroys information, we can hypothesize that there may be information loss during the BP process. So our basic idea is to limit the number of layers that on which error BP is performed. Thus we propose the error Relay Back-Propagation (Relay BP) algorithm1 for deep neural network training. For the Place2 Scene Classification Challenge, we further designed two CNN architectures, which we call model A and model B, respectively. They are both trained by our error relay BP algorithm. We also propose a class-aware sampling method to address the issue of large scale of data and unbalanced distribution of the numbers of samples in different classes. In the following, we describe the above three components in detail.
2
Relay Back-Propagation
As we just analyzed, we should not back-propagate the error information across too many layers. So we first break all the layers into several overlapping segments. Each segment has at most N consecutive layers, where N is the upper limit of the number of layers that we deem that the error BP can still carry enough information when reaching the lowest layer in the segment. Denote the 1 Patent
pending.
2
Figure 2: Illustration of relay BP. We illustrate with two relays. The green and the red layers make segment 1. The red and the pink ones make segment 2, and the pink and purple ones make segment 3. The nearby segments overlap with some layers. In order to limit the number of layers that error BP goes, relay BP concatenate output layers to the highest layers in segments 2 and 3, making two branches. number of overlapping layers between nearly segments as n. Both N and n are tunable empirical parameters. Second, we concatenate output layers to the highest layer of each segment, except the first segment, which already includes the output layers of the whole neural network. With the concatenated output layers, we can compute the loss of each segment. At the highest layer of each segment, the feed-forward information is passed to both the next layer in the original network and the output layers concatenated to it. Third, for each segment, we feed forward the processed training data and compute the gradient of the weights between successive layers by back-propagating the error of each segment. Since the depth of each segment is not very large, we deem that the error BP is effective with each segment. Fourth, we fuse the gradients of weights between overlapping layers between nearby segments, e.g., by simple averaging. For those gradients of weights which are not between overlapping layers, we simply keep their values. Then we obtain the gradients of all the weights in the original network. Finally, we update the weights using the computed gradients and a learning rate, which decreases with the iteration of relay BP.
3
Two Network Architectures
Specifically for Place2 Scene Classification Challenge, we designed two CNNs, which we call model A and model B, respectively. There parameters are shown in Figure 3. Model A has 22 weight layers in total, adding three 3×3 convolutional layers in VGG-19 and replacing the last max-pooling layer with the SPP layer. ×5 denotes stacking five convolutional layers. In model B, the convolutional block is expected to integrate multi-scale information. It is modified based on the Inception module in GoogLeNet. “dbl 3
Figure 3: Parameters of two CNNs, model A and model B. “branch” means that an output layer is concatenated right after that layer. 3 × 3” denotes stacking two convolutional layers with 3 × 3 filters.
4
Class-aware Sampling
The Places2 dataset has about 8, 000, 000 training images in total. The numbers of images in different classes are imbalanced, ranging between 4,000 and 30,000 per class. The large scale data and non-uniform class distribution pose great challenges for model learning. To address this issue, we apply a sampling strategy, named class-aware sampling, during training. We use two types of lists, one is a class list, and the other is per-class image lists. When getting a training batch in an iteration, we first sample a class in the class list, such as class A, then sample an image in the class-specific list of class A. Then sample another class, such as class B and its image. When reaching the end of Class B’s image list, a shuffle operation is performed. When reaching the end of class list, a shuffle operation is also performed. We leverage such a class-aware sampling strategy to effectively tackle the imbalanced distribution with such extra operations, while are cheap and can gain about 0.6% improvement on the validation accuracy.
5
Experimental Results
We first show the error rates on the validation set of Place2. The implementation is based on the common library Caffe. We carry out the following experiments to verify the effect of Relay BP with different testing methods and architectures. 4
Figure 4: Error rates (%) on the validation set of Place2.
Figure 5: Error rates (%) on the validation set of Place2. Learning with one loss and back-propagation is the baseline. We apply multiloss with back-propagation [4] as another comparison method. We empirically find that the performance benefits from adding an auxiliary loss branch, when the learning rate reduces to 10−3 . Thus, we apply such a mechanism in the experiments. The results are show in Figure 4. In the brackets are the improvements over the baseline. Clear advantage of our method can be observed in the results. The improvements by model B are less than those by model A. We conjecture that it is caused by the skip connection in the module of model B, which is beneficial for passing useful information to lower layers. On the other hand, the improvements of single model over center crop are less impressive, compared to the empirical results on the ImageNet classification dataset. Next, we show the errors on the test dataset of Place 2. The results are shown in Figure 5. The model ensemble achieves the best result on test set. The gap between model ensemble and single models is marginal, about 0.4%. From center crop to single model, and further to model ensemble, the improvement is insignificant. We conjecture that large scale training data enhances the capability of convolutional neural networks. By our results shown in Figure 5, our team, WM, won the 1st place in the Places2 Scene Classification Challenge. Actually, our five submissions won the top five places.
5
6
Conclusions
We propose the relay BP for more effective deep neural network training. Specifically for Place2 Scene Classification Challenge, we further propose two CNN models and a class-aware sampling method. Our five submissions won the top five places, testifying to the effectiveness of the proposed key components.
References [1] Yoshua Bengio, Ian Goodfellow, and Aaron Courville. Deep Learning. in preperation, 2016. [2] Thomas M. Cover and Joy A. Thomas. Elements of Information Theory. Wiley-Interscience, 2nd ed. edition, 2006. [3] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015. [4] Chen-Yu Lee, Saining Xie, Patrick Gallagher, Zhengyou Zhang, and Zhuowen Tu. Deeply-supervised nets. In AISTATS, 2015.
6