Towards an Optimal Implementation of MLP in FPGA - Springer Link

Report 4 Downloads 73 Views
Towards an Optimal Implementation of MLP in FPGA E.M. Ortigosa1, A. Ca˜ nas1 , R. Rodr´ıguez1, J. D´ıaz1 , and S. Mota2 1

Dept. of Computer Architecture and Technology, University of Granada, E-18071 Granada, Spain {eva, acanas, rrodriguez, jdiaz}@atc.ugr.es 2 Dept. of Informatics and Numerical Analysis, University of C´ ordoba, E-14071 C´ ordoba, Spain [email protected]

Abstract. We present the hardware implementation of partially connected neural network that is defined as an extended of the MultiLayer Perceptron (MLP) model. We demonstrate that partially connected neural networks lead to a higher performance in terms of computing speed (requiring less memory and computing resources). This work addresses a complete study that compares the hardware implementation of MLP and a partially connected version (XMLP) in terms of computing speed, hardware resources and performance cost. Furthermore, we study also different memory management strategies for the connectivity patterns.

1

Introduction

The implementation of Artificial Neural Networks (ANNs) as embedded systems using FPGA devices has become an interesting research field in the last years [1,2,3,4,5]. The work presented here studies the implementation viability and efficiency of ANNs into reconfigurable hardware (FPGA) for embedded systems, such as portable real-time ASR (Automatic Speech Recognition) devices for consumer applications, vehicle equipment (GPS navigator interface), toys, aids for disabled persons, etc. The application chose has been the ASR and among the different ANN models available used for ASR, we have focused on the MultiLayer Perceptron (MLP). The hardware implementation of ANNs usually focuses on the conventional software model and tries to parallelize the whole processing scheme. Nevertheless, the optimization of the traditional ANN model towards a less densely connected network leads to a significant improvement in the system computing speed (requiring less memory and computing resources). In this way, the main innovation of this contribution is the description of a modified version of the MLP called eXtended Multi-Layer Perceptron (XMLP) [6] towards a less densely connected network. The paper is organized as follows. Section 2 briefly describes the MLP and XMLP models. Then we describe and evaluated the detailed hardware implementation strategies of the MLP and XMLP and we presents the discussion of the results (Section 3). Finally, Section 4 summarizes the conclusions. K. Bertels, J.M.P. Cardoso, and S. Vassiliadis (Eds.): ARC 2006, LNCS 3985, pp. 46–51, 2006. c Springer-Verlag Berlin Heidelberg 2006 

Towards an Optimal Implementation of MLP in FPGA

2 2.1

47

MLP / XMLP Multi-Layer Perceptron (MLP)

The MLP is an ANN with processing elements or neurons organized in a regular structure with several layers: an input layer (that is simply an input vector), some hidden layers and an output layer. For classification problems, only one winning node of the output layer is active for each input pattern. Each layer is fully connected with its adjacent layers. There are no connections between non-adjacent layers and there are no recurrent connections. Each of these connections is defined by an associated weight. Each neuron calculates the weighted sum of its inputs and applies an activation function that forces the neuron output to be high or low. In this way, propagating the output of each layer, the MLP generates an output vector from each input pattern. The synaptic weights are adjusted through a supervised training algorithm called backpropagation [7]. The most frequently activation function used is the sigmoid, although there are other choices such as a ramp function, a hyperbolic tangent, etc. 2.2

Extended Multi-Layer Perceptron (XMLP)

The XMLP is a feed-forward neural network with an input layer (without neurons), a number of hidden layers selectable from zero to two, and an output layer. In addition to the usual MLP connectivity, any layer can be two-dimensional and partially connected to adjacent layers. As illustrated in Fig. 1, connections come out from each layer in overlapped rectangular groups. The size of a layer l and its partial connectivity pattern are defined by six parameters in the following form: x(gx , sx )× y(gy , sy ), where x and y are the sizes of the axes, and g and s specify the size of a group of neurons and the step between two consecutive groups, both in abscissas (gx , sx ) and ordinates (gy , sy ). A neuron i in the X-axis at layer l+1 (the upper one in Fig. 1) is fed from all the neurons belonging to

Fig. 1. Structure of an XMLP layer and its connections to the next layer

48

E.M. Ortigosa et al.

the i-the group in the Xaxis at layer l (the lower one). The same connectivity definition is used in the Y-axis. When g and s are not specified for a particular dimension, the connectivity assumed for that dimension is gx = x and sx = 0, or gy = y and sy = 0. Thus, MLP is a particular case of XMLP where gx = x, sx = 0, gy = y and sy = 0 for all layers.

3

Hardware Implementation: XMLP vs MLP

In order to illustrate the hardware implementation of the MLP and XMLP systems, and address a complete comparative study of these ANNs, we have chosen a specific speaker-independent isolated word recognition application [8]. Nevertheless, many other applications require embedded systems in portable devices (low cost, low power and reduced physical size). For our test bed application, we need an MLP / XMLP with 220 scalar data in the input layer and 10 output nodes in the output layer. The network input consists of 10 vectors of 22 components (10 cepstrum, 10 Δcepstrum, energy, Δenergy) obtained after preprocessing the speech signal. The output nodes correspond to 10 recognizable words extracted from a multi-speaker database [9]. After testing different architectures [6], the best classification results (96.83% of correct classification rate in a speaker-independent scheme) have been obtained using 24 nodes in a single hidden layer, with the connectivity of the XMLP defined by 10(4,2)×22 in the input layer and 4×6 in the hidden layer. For the MLP / XMLP implementations, we have chosen fix point computations with two’s complement representation and different bit depths for the stored data (inputs, weights, activation function, outputs, etc). It is necessary to limit the range of different variables: inputs to the MLP / XMLP and output of the activation function (8 bits), weights (8 bits) and inputs to the activation function, which is defined by a Look-Up-Table (LUT) that stores the useful values. After taking all these discretization simplifications the model achieves similar classification results. The results of the hardware system differ in less than 1% from the software full resolution results. The systems have been designed and translated to EDIF files using DK Design Suite tool from Celoxica [10]. Then, the designs have been placed and routed in a Virtex-E 2000 FPGA using the development environment ISE Foundation 3.5i from Xilinx [11]. We have evaluated the serial and parallel approaches. The serial version emulates the software implementation, using only one processing unit that is multiplexed to compute all the neurons of the ANN. On the other hand, the parallel version makes use of one processing unit per neuron at the hidden layer. In addition, we have implemented three different versions of approaches using different memory resources: (A) using only distributed memory resources, (B) using distributed memory resources and embedded memory blocks and (C) using only embedded memory blocks. Tables 1 and 2 summarize the hardware implementation characteristics of the MLP and XMLP respectively.

Towards an Optimal Implementation of MLP in FPGA

49

Table 1. Implementation characteristics of the MLP designs. (A) Only distributed RAM (B) Both EMBs RAM and distributed RAM. (C) Only EMBs RAM.

MLP design A: Serial A: Parallel B: Serial B: Parallel C: Serial C: Parallel

Hr Perf. Pc # # EMBs # sys. gates Max.Freq. # Dt Sg / Dt slices RAM (Sg ) (MHz) cycles (data/s) 2582 0 245268 19.7 5588 3535 69.38 6321 0 333828 17.2 282 60968 5.48 710 11 218712 16.1 5588 2671 81.86 4411 24 518928 16.7 282 59326 8.75 547 14 248996 15.3 5588 2733 91.07 4270 36 695380 15.4 282 54692 12.71

Table 2. Implementation characteristics of the XMLP designs. (A) Only distributed RAM (B) Both EMBs RAM and distributed RAM. (C) Only EMBs RAM.

XMLP design A: Serial A: Parallel B: Serial B: Parallel C: Serial C: Parallel

Hr Perf. Pc # # EMBs # sys. gates Max.Freq. # Dt Sg / Dt slices RAM (Sg ) (MHz) cycles (data/s) 2389 0 181858 22.3 2595 8592 21.17 5754 0 258808 20.8 167 124844 2.07 1700 5 154806 14.0 2595 5384 28.75 5032 96 1722376 15.6 167 93170 18.49 1608 10 202022 13.0 2595 4990 40.48 4923 147 2526915 15.4 167 92365 27.36

The tables include the consumption of hardware resources of each approach in terms of slices and embedded memory blocks (EMB). We also calculate a global hardware resource indicator as the total number of equivalent system gates (Sg ) consumed by each implemented version. The gate counting used on Virtex-E devices is consistent with the system gate counting used on the original Virtex devices [11]. In order to evaluate the computing speed we include the maximum clock frequency allowed by each implementation (this is given by the ISE Foundation after the synthesization stage) and the number of clock cycles consumed to evaluate a single input vector. We have also calculated a global performance indicator as the data throughput (Dt ) of the system, i.e. the number of input vectors processed per second (note that each vector consist of 220 components). Finally, to better illustrate the trade off between these two characteristics, which can be adopted as a convenient design objective during the architecture definition process, we have evaluated the performance cost (Pc ) achieved by each implementation as the ratio Pc = Sg / Dt . This feature indicates the hardware resources required to achieve a data throughput of one. Therefore, we can compare directly the hardware resources of the different approaches to achieve the same performance. The comparative results are summarized in Fig 2.

50

E.M. Ortigosa et al.

Serial version

Parallel version

300

MLP

3000

XMLP

MLP

XMLP

2500

# system gates (·103)

# system gates (·103)

250 200 150 100 50

2000 1500 1000 500 0

0

A

B

A

C

B

C

(a) Serial version

Parallel version 800

10

MLP

MLP

XMLP

XMLP

700

8

Data throughput (·103)

Data throughput (·103)

9

7 6 5 4 3 2

600 500 400 300 200 100

1

0

0

A

B

C

A

B

C

(b) Parallel version

Serial version 30

100

MLP

90

XMLP

MLP

70

Performance cost

Performance cost

XMLP

25

80

60 50 40 30 20

20 15 10 5

10 0

0

A

B

C

A

B

C

(c)

Fig. 2. XMLP vs. MLP in the serial (left) and parallel (right) approaches: a) Number of equivalent system gates, Sg . b) Data throughput, Dt . c) Performance cost, Pc .

We can see that the XMLP significantly improves all the indicators—hardware resources consumption (Sg ), performance as data throughput (Dt ) and performance cost (Pc )—compared to MLP. In both, the serial and parallel version, XMLP improves all the indicators by a factor of two approximately. Among the three versions evaluated regarding the data storage strategy (A, B and C), we see that the best approach is the one that only makes use of distributed memory resources (A). This is so, because the network memory requirements per neuron is small (220 input weights for each hidden neuron) compared to the size of the embedded memory blocks (512 elements). Therefore, the embedded memory resources are being used inefficiently (although this factor depends on the network topology, mainly on the hidden neuron fan-in). This is more dramatic in the XMLP where the number of inter-neuron connections is greatly reduced. In this case, the use of only distributed memory is mandatory since the embedded memory blocks are very inefficiently used given the

Towards an Optimal Implementation of MLP in FPGA

51

reduced number of weights that are stored in each of them, provided that the parallel version needs to access in parallel to the weights of different hidden neurons.

4

Conclusions

The main innovative idea of this contribution regarding hardware implementation of neural networks is that the final approach, can be highly improved if the previous software simulations include the possibility of simplifying the network topology (reducing the number of interneuron connections). Along this line we have defined an extended multilayer perceptron that allows the definition of networks of different interlayer topologies. This aspect has a great impact on the neural network implementation on specific hardware as has been shown in Section 3. Choosing between different topologies can be done in preliminary software simulations applying for instance genetic algorithms to obtain the best configuration before the hardware implementation is addressed.

Acknowledgement This work has been supported by TIN2004-01419 and SENSOPAC proyects.

References 1. Zhu, J., Sutton, P.: FPGA Implementations of Neural Networks – a Survey of a Decade of Progress, LNCS, 2778 (2003), 1062–1066 2. Zhu, J., Milne, G., Gunther, B.: Towards an FPGA Based Reconf. Comp. Environment for N. N. Impl., in Proc. 9th Intl.Conf. on ANN, 2 (1999), 661–667 3. Gon¸calves, R. A., Moraes, P.A.,. . . : ARCHITECT-R: A System for Reconfig. Robots Design, in ACM Symp. on Appl. Comp., ACM Press (2003), 679–683 4. Hammerstrom, D.: Digital VLSI for Neural Networks, The Handbook of Brain Theory and Neural Networks, Second Edit., Ed. Michael Arbib, MIT Press (2003) 5. Gao, C., Hammerstrom, D., Zhu, S., Butts, M.: FPGA implementation of very large associative memories, in: FPGA Implementations of Neural Networks, Omondi, A.R.; Rajapakse, J.C. (Eds.), Springer Berlin H., New York (2005) 6. Ca˜ nas, A., Ortigosa, E.M., D´ıaz, A., Ortega, J.: XMLP: A Feed-Forward Neural Network with two-dimensional Layers and Partial Connectivity, LNCS, 2687 (2003), 89–96 7. Widrow, B., Lehr, M.: 30 years of adaptive neural networks: Perceptron, Madaline and Backpropagation, Proc. of the IEEE, 78 (9) (1990), 1415–1442 8. Ortigosa, E.M., Ca˜ nas, A., Ros, E., Carrillo, R.R.: FPGA implementation of a perceptron-like N. N. for embedded applications, LNCS, 2687 (2003), 1–8 9. Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., Lang, K.: Phoneme Recognition Using T-D N.N. IEEE T. on Ac., Sp. and Sig. Proc., 37 (3)(1989) 10. Celoxica: Technical Library, http://www.celoxica.com/techlib/ 11. Xilinx, http://www.xilinx.com/