1
EvoNet: Evolutionary Synthesis of Deep Neural Networks
arXiv:1606.04393v1 [cs.CV] 14 Jun 2016
Mohammad Javad Shafiee, Student Member, IEEE, Akshaya Mishra, and Alexander Wong, Senior Member, IEEE
Abstract—In this study, we introduce the idea of synthesizing new highly efficient, yet powerful deep neural networks via a novel evolutionary process from ancestor deep neural networks. The architectural genetics of ancestor deep neural networks is encapsulated using the concept of synaptic probability density models, which can be viewed as the ’DNA’ of these ancestor networks. These synaptic probability density models from the ancestor networks are then leveraged, along with computational environmental factor models to synthesize new descendant deep neural networks with different network architectures in a random manner to mimic natural selection and random mutations. These ’evolved’ deep neural networks (which we will term EvoNets) are then trained into fully functional networks, like one would train a newborn, and have more efficient, more varied synapse architectures than their ancestor networks, while achieving powerful modeling capabilities. Experimental results using the MSRA-B dataset for the purpose of image segmentation was performed, and it was demonstrated that the synthesized EvoNets can achieve state-of-the-art Fβ score (0.872 at second generation, 0.859 at third generation, and 0.839 at fourth generation) while having synapse architectures that are significantly more efficient (∼19X fewer synapses by the fourth generation) compared to the original ancestor deep neural network. Index Terms—Deep learning, Evolution, Synthesis, Deep Neural Networks, Natural Selection, Random Mutation.
I. I NTRODUCTION
D
EEP learning, especially deep neural networks (DNNs) [1] have shown considerable promise through tremendous results in recent years, significantly improving the accuracy of a variety of challenging problems when compared to other machine learning methods. For example, the application of DNNs in image classification and segmentation [2], [3] demonstrated near-human recognition accuracies based on the ImageNet dataset [4], [5]. It has also been shown that DNNs outperform other machine learning methods in realm of speech recognition [6], [7], [8], enabling voice-driven human-computer interfacing in a reliable manner. Although deep learning approaches such as deep neural networks have proven their ability in a wide variety of applications, they require high performance computing systems, such as those based on GPUs, due to the tremendous quantity of computational layers they possess that have thousands upon thousand of parameters to compute. Furthermore, the complex architectures of deep neural networks typically require machine learning experts to design along with computationally expensive hyperparameter optimization strategies. This issue Mohammad Javad Shafiee and Alexander Wong are with the University of Waterloo, Waterloo, ON, Canada. Akshaya Mishra is with Miovision Technologies Inc. Waterloo, ON, Canada
of complexity in the design of deep neural networks has increased greatly in recent years [5], [9], [10], driven by the demand for increasingly deeper and larger deep neural networks to boost modeling accuracy. As such, it has become increasingly more difficult to take advantage of such complex deep neural networks in scenarios where computational and energy resources are scarce. To enable the widespread use of deep learning, there has now been a recent drive towards obtaining highly-efficient deep neural networks with strong modeling power and accuracy. Deep neural networks with highly efficient architectures are critically important for enabling proliferation of deep learning across a wide range of applications by allowing such networks to operate on lowcost, low-power, low-memory platforms such as smartphones or embedded systems. Much of the work in the realm of obtaining efficient deep neural networks have focused on compressing existing trained deep neural networks using traditional, deterministic techniques [11] from lossless and lossy compression such as quantization [12], [13], deterministic pruning [11], [14], Huffman coding [13], and hashing [15]. Rather than attempting to take an existing deep neural network and force it into a smaller representation heuristically, we instead consider a radically different idea in this study: Can deep neural networks evolve naturally over generations into highly efficient deep neural networks? Using an example of evolutionary progress towards efficiency from nature, a recent study by Moran et al. [16] proposed that the eyeless Mexican cavefish evolved to lose its vision system over generations due to the high metabolic cost of vision. Therefore, by evolving naturally over generations in a way where the cavefish lost its vision system, the amount of energy expended is significantly reduced and thus improves survivability in subterranean habitats where food availability is low. Therefore, the ability to replicate this evolutionary process to produce highly-efficient deep neural networks over generations can have considerable benefits. Motivated by this evolutionary progress towards efficiency in nature, in this study we entertain a radically different notion for obtaining highly-efficient deep neural networks by introducing an evolutionary process to the synthesis of deep neural networks over successive generations based on ancestor deep neural networks to obtain more efficient deep neural networks. These ‘evolved’ deep neural networks (which we will term EvoNets) will naturally have more efficient, more varied synapse architectures than their ancestor networks, while achieving powerful modeling capabilities.
2
Fig. 1. Evolutionary synthesis process of highly-efficient deep neural networks: The architectural genetics of ancestor deep neural networks are encoded in synaptic probability density models P (Hg |Wg−1 ), which can be viewed as the ‘DNA’ of these ancestor networks. These synaptic probability density models from the ancestor networks are then leveraged, along with computational environmental factor models F (·) to synthesize new descendant deep neural networks with different network architectures in a random manner to mimic natural selection and random mutations. Each generation of ‘evolved’ networks (EvoNets) can then synthesize the next generation of EvoNets with architectures that have improved efficiency over its ancestor networks.
II. E VO N ETS The proposed evolutionary synthesis of deep neural networks is primarily inspired by real biological evolutionary processes. In nature, heritable traits that are passed down from generation to generation through DNA may change over successive generations due to factors such as natural selection and random mutation, giving rise to diversity and enhanced traits in later generations. To adopt the idea of an evolutionary process for the purpose of obtaining highlyefficient deep neural networks, we introduce a number of fundamental computational constructs to replicate the following key ingredients of evolution: i) genetic encoding (i.e., DNA), ii) natural selection, and iii) random mutation.
A. Genetic Encoding Here, we replicate the idea of genetic encoding by encapsulating the architectural traits of ancestor deep neural networks in the form of synaptic probability density models. One can then view these synaptic probability density models as the ‘DNA’ of the ancestor networks. Let P (Hg |Hg−1 ) denotes the conditional probability of the architecture of the deep neural network in generation g (denoted by Hg ), given the architecture of its ancestor deep neural network in the previous generation (denoted by Hg−1 ). The architecture Hg−1 is encoded by the set of neurons N and the synapses hg−1,k where hg−1,k ∈ Hg−1 is a synapse between two neurons (ni , nj ) ∈ N . The genetic information of a deep neural network in generation g is encoded in Wg which characterizes the synaptic strengths in the network. Corresponding to each synapse sk there exists a wk ∈ W which encodes the synaptic strength. Synaptic strengths of ancestor deep neural networks are involved in synthesizing the next generation of descendant deep neural networks, where areas of stronger synapses can be viewed as stronger heritable traits to the descendant deep neural networks. Therefore, P (Hg |Hg−1 ) can be reformulated as P (Hg |Wg−1 ).
B. Natural Selection and Random Mutation The ideas of natural selection and random mutation are replicated through the introduction of a random network synthesis process for synthesizing descendant deep neural networks based on the synaptic probability density models that encodes the architectural genetics of the ancestor deep neural networks, as well as an environmental factor model to mimic the environmental conditions that drive natural selection. More specifically, a synapse is synthesized randomly between two neurons in a descendant deep neural network based on P (Hg |Wg−1 ) (i.e., which replicates genetic inheritance), and an imposed environmental constraints F(E). The architecture of a descendant deep neural network at generation g can be synthesized randomly via P (Hg ) which can be formulated as: P (Hg ) ≈ F(E) · P (Hg |Wg−1 )
(1)
where the new network architecture is influenced by the heritable genetic information from the ancestor deep neural network, P (Hg |Wg−1 ), and environmental factor modeling by F(E). The environmental factors can be the combination of limitations and restrictions that are imposed upon the descendant deep neural networks that they must adapt to. To have a better intuitive understanding, an illustrative example of how one can impose environmental conditions to promote the evolution of highly efficient deep neural networks is provided as below. C. Efficiency-driven Evolutionary Synthesis Let us assume that an original ancestor network structure H1 is available. Network H1 has been trained and W1 encodes the synaptic strengths of the network. Suppose that we put this original ancestor network (first generation), and thus its descendant networks in subsequent generations as well, within an environment where energy efficiency is an important factor to survive. One of the main environmental factors in encouraging energy efficiency during evolution is to restrict the resources available. For example, in a study by Moran et al. [16], it was proposed that the eyeless Mexican
3
cavefish lost its vision system over generations due to the high energetic cost of neural tissue and low food availability in subterranean habitats. Their study demonstrated that the cost of vision is about 15% of resting metabolism for a 1g eyed phenotype. Since cave animals such as the eyeless Mexican cavefish live in subterranean habitats where food is limited and vision is not particularly useful for survival given the lack of light, losing their vision system through evolution has significant energy savings and consequently this evolved efficiency improves survivability. As such, we are motivated by nature to computationally restrict resources available to descendant deep neural networks to encourage the evolution of highly-efficient deep neural networks. To replicate environmental constraints when synthesizing descendant deep neural networks to encourage the evolution of highly-efficient deep neural networks, we introduce an environmental constraint C that constrains the quantity of synapses that can be synthesized in the descendant deep neural network. Considering the aforementioned example, the descendant deep neural networks must take on architectures with efficient energy consumption to be able to survive. The main factor in energy consumption is the quantity of synapses in the network. Therefore, an imposed environmental constraint of the percentage of allowable synapses in the descendant network mimics the natural selection process by forcing the descendant networks to be more efficient than its ancestor networks via F(E) = C. Therefore, given P (Hg |Wg−1 ) and C, the synapse synthesis probability P (Hg ) can be formulated as follows: Y P (Hg ) = C · P (hg,i |wg−1,i ) (2)
Fig. 2. MSRA-B image dataset: This dataset contains 5000 natural images divided into 2500, 500 and 2000 images as training, validation and test samples, respectively. The ground truth maps are provided with pixel-wise annotation. Examples of images and their corresponding ground truth maps in the MSRA-B image dataset are shown here.
MSRA-B dataset [17] for the purpose of image segmentation. Three generations of descendant deep neural networks (second generation, third generation, and fourth generation) were synthesized beyond the original, first-generation ancestor deep neural network (see description of network architecture in Section III-A). The environmental constraint imposed during synthesis in this study is that the descendant deep neural networks should not have more than 60% of the total number of synapses that its direct ancestor network would possess, thus greatly encouraging the evolution of highly-efficient deep neural networks.
i
where C is the percentage of allowable synapses in the descendant deep neural network, and P (hg,i |wg−1,i ) can be formulated as the following exponential distribution: w g−1,i −1 (3) P (hg,i |wg−1,i ) = exp Z where Z is a normalization constant to make the probability as [0, 1]. The random element of the network synthesis process mimics the random mutation process and promotes network architectural diversity. The synthesized descendant deep neural networks are then trained into fully-functional networks, like one would train a newborn, and the evolutionary process is repeated for producing subsequent generations of descendant deep neural networks. An overview of the proposed evolutionary process for obtaining highly-efficient deep neural networks is demonstrated in Figure 1. These ‘evolved’ deep neural networks (EvoNets) have more efficient, more varied architectures than their ancestor deep neural networks due to the natural selection and random mutation processes, while achieving strong modeling power and capabilities. III. E XPERIMENTS AND D ISCUSSION To investigate the efficacy of the proposed EvoNet evolutionary synthesis process for evolving highly-efficient deep neural networks, experiments were performed using the
A. Network architecture The network architecture of the original, first generation ancestor deep neural network used in this study builds upon the VGG16 very deep convolutional neural network architecture [5] for the purpose of image segmentation as follows. The outputs of the c3, c4, and c5 stacks of VGG16 are fed into newly added c6, c7, c8 stacks, respectively. The output of the c7 and c8 stacks are then fed into d1 and d2 stacks. The concatenated outputs of the c6, d1, and d2 stacks are then fed into the c9 stack. The output of the c5 stack is fed into c10 and c11 stacks. Finally, the combined output of the c9, c10 and c11 stacks are fed into a softmax layer to produce final segmentation result. The details of the different stacks are as follows: c1: 2 convolutional layers of 64, 3 × 3 local receptive fields, c2: 2 convolutional layers of 128, 3 × 3 local receptive fields, c3: 3 convolutional layers of 256, 3 × 3 local receptive fields, c4: 3 convolutional layers of 512, 3 × 3 local receptive fields, c5: 3 convolutional layers of 512, 3 × 3 local receptive fields, c6: 1 convolutional layers of 256, 3 × 3 local receptive fields, c7 and c8: 1 convolutional layers of 512, 3 × 3 local receptive fields, c9: 1 convolutional layers of 384, 1 × 1 local receptive fields, c10 and c11: 2 convolutional layers of 512, 11 × 11 local receptive fields and 384, 1 × 1 local receptive fields, d1 and d2 are deconvolutional layers.
4
TABLE I P ERFORMANCE METRICS FOR DIFFERENT GENERATION OF E VO N ETS
MDF [18] EvoNet G1 EvoNet G2 EvoNet G3 EvoNet G4
Number of synapses — 63767232 16126732 6517011 3322272
Synapse efficiency – 1X 4X 9.8X 19.2X
B. Dataset The dataset used in this study is the MSRA-B dataset [17], which consists of 5000 natural images and their corresponding ground truth maps where the salient objects in the images are segmented with pixel-wise annotation. The dataset is divided into training, validation and testing groups containing 2500, 500 and 2000 images, respectively. Figure 2 illustrates some of the example images from the dataset with their corresponding ground truths. C. Experimental results To evaluate the performance of the evolved descendant deep neural networks at different generations, the MAE, Fβ score (where β 2 =0.3 [18]), precision, and recall metrics were computed for each of the descendant deep neural networks across the 2000 test images of the MSRA-B dataset that were not used for training. As a reference, the same performance metrics was also computed for the original, first generation ancestor deep neural network as well as the state-of-theart multiscale deep feature-based method proposed by Li et al. [18]. Table I shows the performance metrics of the descendant deep learning networks (second generation and third generation), the original, first generation ancestor deep learning network, and the method by Li et al. [18]. A number of observations can be made based on the experimental results shown in Table I. First, it can be observed that the performance differences from one generation of evolved descendant deep neural networks to the next generation are small, which indicates that the modeling power of the ancestor deep neural networks are largely preserved in the descendant deep neural networks. Furthermore, it can be observed that the synthesized EvoNets by the third generation can achieve state-of-the-art Fβ score (0.872 at second generation, 0.859 at third generation, and 0.839 at fourth generation, compared to 0.865 as reported by Li et al. [18] for the state-of-the-art multiscale deep featurebased method) while having synapse architectures that are significantly more efficient (∼19X fewer synapses by the fourth generation) compared to the original, first generation ancestor deep neural network. This nineteen-fold increase in architectural efficiency while maintaining modeling power as demonstrated in the experimental results clearly show the efficacy of obtaining highly-efficient deep neural networks via the proposed EvoNet evolutionary synthesis process. ACKNOWLEDGMENT This work was supported by the Natural Sciences and Engineering Research Council of Canada, Canada Research Chairs
Precision 0.864 0.871 0.891 0.879 0.839
Recall 0.870 0.846 0.815 0.801 0.841
Fβ score 0.865 0.867 0.872 0.859 0.839
MAE 0.110 0.065 0.064 0.069 0.076
Program, and the Ontario Ministry of Research and Innovation. The authors also thank Nvidia for the GPU hardware used in this study through the Nvidia Hardware Grant Program. AUTHOR C ONTRIBUTIONS A.W. conceived the concept of evolutionary synthesis for deep learning proposed in this study. M.S. and A.W. formulated the evolutionary synthesis process proposed in this study. A.M. implemented the evolutionary synthesis process and performed all experiments in this study. A.W., M.S., and A.M. all participated in writing this paper. R EFERENCES [1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, 2015. [2] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems (NIPS), 2012. [3] C. Farabet, C. Couprie, L. Najman, and Y. LeCun, “Learning hierarchical features for scene labeling,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2013. [4] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,” International Journal of Computer Vision (IJCV), 2015. [5] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014. [6] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath et al., “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal Processing Magazine, 2012. [7] A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates et al., “Deep speech: Scaling up end-to-end speech recognition,” arXiv preprint arXiv:1412.5567, 2014. [8] D. Amodei, R. Anubhai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, J. Chen, M. Chrzanowski, A. Coates, G. Diamos et al., “Deep speech 2: End-to-end speech recognition in english and mandarin,” arXiv preprint arXiv:1512.02595, 2015. [9] R. K. Srivastava, K. Greff, and J. Schmidhuber, “Training very deep networks,” in Advances in Neural Information Processing Systems (NIPS), 2015, pp. 2368–2376. [10] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1–9. [11] Y. LeCun, J. S. Denker, S. A. Solla, R. E. Howard, and L. D. Jackel, “Optimal brain damage.” in Advances in Neural Information Processing Systems (NIPS), 1989. [12] Y. Gong, L. Liu, M. Yang, and L. Bourdev, “Compressing deep convolutional networks using vector quantization,” arXiv preprint arXiv:1412.6115, 2014. [13] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” arXiv preprint arXiv:1510.00149, 2015. [14] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and connections for efficient neural network,” in Advances in Neural Information Processing Systems (NIPS), 2015. [15] W. Chen, J. T. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen, “Compressing neural networks with the hashing trick,” arXiv preprint arXiv:1504.04788, 2015.
5
[16] D. Moran, R. Softley, and E. J. Warrant, “The energetic cost of vision and the evolution of eyeless mexican cavefish,” Science advances, 2015. [17] T. Liu, Z. Yuan, J. Sun, J. Wang, N. Zheng, X. Tang, and H.-Y. Shum, “Learning to detect a salient object,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011. [18] G. Li and Y. Yu, “Visual saliency based on multiscale deep features,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.