Learning in Stochastic Bit Stream Neural Networks

Report 28 Downloads 123 Views
CONTRIBUTED ARTICLE

Learning in Stochastic Bit Stream Neural Networks JIEYU ZHAO , JOHN SHAWE-TAYLOR AND MAX VAN DAALEN Department of Computer Science, Royal Holloway, University of London (Received 20 November 1994; accepted 15 May 1995) Abstract|This paper presents learning techniques for a novel feedforward stochastic neural network. The model uses stochastic weights and the `bit stream' data representation . It has a clean analysable functionality and is very attractive with its great potential to be implemented in hardware using standard digital VLSI technology. The design allows simulation at three di erent levels and learning techniques are described for each level. The lowest level corresponds to on-chip learning. Simulation results on three benchmark MONK's problems and handwritten digit recognition with a clean set of 500 1616 pixel digits demonstrate that the new model is powerful enough for the real world applications.

Keywords|Stochastic Computing, Bit Stream, Stochastic Neural Networks, Gradient Descent Learning.

1. INTRODUCTION

The stochastic neural network is a very promising model for global optimization with its ability to escape from local minima. In a stochastic neural network the information is usually fuzzy and noisy and each neuron does not have to make very ne discriminations on its inputs. The output of the network should not be a ected by a single neuron but only determined by the global state of the network. This could be done by building a totally distributed representation into the network. The distributed representation will make the network very robust in a noisy environment and improve the generalisation performance (Rumelhart , 1986). This consideration leads to a natural data representation with `bit streams' in which real values are represented by modulating the frequency of 1's and 0's. This is the central idea of `stochastic computing'(Gains, 1969). Similar techniques of encoding information in the time domain, like the `pulse stream technique' (Murray & Tarassenko, 1994) and `pulse-density modulation'(Tomberg & Kaski, 1990) have already been successfully used in the implementation of the standard sigmoid neurons. Recently the idea of `bit stream' representation has been used to develop a stochastic neuron (ShaweTaylor , 1991). The design is attractive because it is fully digital, but obviates the need for large areas of silicon to implement the neural functionality. et al.

et al.

 Currently with the IDSIA, Corso Elvezia 36, 6900{Lugano, Switzerland.

Indeed, a single neuron occupies no more space than a simple digital counter. The result is a very fast and compact network with clean analysable functionality, which also means that, unlike with analogue implementations, large networks can be constructed without accumulation of noise error. The e ect of the simplicity of the design is a slight deviation from the functionality of the standard sigmoid neuron. The limitation of the weight values forces the network to form a totally distributed representation during the learning process. The purpose of this paper is to present learning techniques at three levels of description of the BSN and corresponding simulations that demonstrate the following points:  Successful training shows that the functionality of a BSN Network is comparable with that of a standard neural network;  Adaptive techniques can be successively applied in large scale simulations for real world application;  Adaptive techniques applied at an intermediate level of description show how BSN Networks can be successfully trained to solve standard problems to which they give interesting and pleasing solutions;  Bit stream implementation of derivative calculation is demonstrated. This points the way to a hardware implementation of the backpropagation algorithm.

2. DEFINITIONS The description of the functionality of the bit stream neuron will rst be presented at the bit stream level. Following this, two further levels of description will be presented which allow for progressively more ecient simulation of the functionality at the expense of some of the ner detail.

2.1. Bit Stream Level Description

All signals processed by bit stream neurons are real values represented by `stochastic bits streams'. A `stochastic bit stream' is simply a random binary sequence which has value 1 or 0. It can be used to represent an analog quantity in various ways. A common method is the linear mapping from the quantity to the probability of the bit in the stream being 1. We call such a bit stream a Bernoulli sequence. The generation of digital bit streams modulated to convey arbitrary probability values is described in the paper of Jeavons (1994). In unipolar representation a real value r 2 [0; 1] is represented by a Bernoulli sequence with probability p = r. To represent a real value in [-1,1] with unipolar representation an extra sign bit is needed. In bipolar representation a real value r 2 [?1; 1] is represented by a Bernoulli sequence with probability p = (r +1)=2. For the unipolar representation only a simple AND gate is needed to carry out the multiplication operation and for the bipolar representation multiplication again can be easily implemented by an EXCLUSIVE-NOR gate. A stochastic bit stream neuron is a processing unit which carries out very simple operations over its input bit streams. All input bit streams are combined with their corresponding weight bit streams and then the weighted bits are summed up. The nal total is compared to a threshold value. If the sum is not less than the threshold the neuron gives an output 1, otherwise 0. There are two di erent versions of the stochastic bit stream neuron according to the di erent data representations. We will refer to the bipolar representation with values in the range [-1,1] by the term XNOR-BSN, while the unipolar representation will be referred to by the term AND-BSN. XNORBSN and AND-BSN have quite di erent features. It seems that AND-BSN is more suitable for feedforward and recurrent networks which involve learning, and XNOR-BSN is good for simulated annealing in a recurrent network (Zhao & Shawe-Taylor, 1995). et al.

De nition (AND-BSN): A n-input AND version stochastic bit stream neuron has n weights in the

range [-1,1] and n inputs in the range [0,1], which are all unipolar representations of Bernoulli sequences.

An extra sign bit is attached to each weight Bernoulli sequence. The threshold  is an integer lying between ?n to n which is randomly generated according to the threshold probability density function (). The computation performed during each operational cycle is (1) combine one bit from each of the n input Bernoulli sequences with one bit from the corresponding n weight Bernoulli sequences using the AND operation. (2) sum up the bits resulting from the n AND operations with their corresponding signs and then compare with the randomly generated threshold value. If the total is not less than the threshold value, the AND-BSN outputs 1, otherwise it outputs 0.

2.2. Moment Generating Function Description

To understand the behaviour of the stochastic bit stream neuron the fundamental analysis of its activation function is crucial. The natural way to describe the functionality of a BSN is at the probability level. The activation function of a BSN can be computed under the assumption that the input and weight bits are all independently generated. For every weighted input line i we represent its net contribution by a random variable Xi , which has the following probability density function.

8 (jw v j + w v )=2 > < (jwii vii j ? wii vii )=2 fi (x) = > : 1 ? j0wi vi j

x=1 x = ?1 x=0

otherwise

where wi 2 [?1; 1] is the corresponding weight value and vi 2 [0; 1] the input value. Supposing the number of the inputs is n and all Xi0 s are independent random variables the moment generating function for the total P weighted input contribution to the neuron X = ni=1 Xi is MX (t) =

n  Y jwi vi j ? wi vi

i=1

2

e?t

 j wi vi j + wi vi t e +(1 ? jwi vi j) + 2

The expectation of the sum is given by

X E (X ) = MX0 (0) = wi vi n

i=1

and the variance is V (X ) = E (X 2 ) ? E (X )2 = MX00 (0) ? MX0 (0)2

=

0 n X @ i=1

=

jwi vi j + wi vi

n X ? i=1

n X

6

2

j =1;j =i

 jw v j ? w v

1 n ! X wi vi wj vj A ? i=1

Output 1

2 2

i i

i i

0.5

The output of a neuron will be 1 if the total net contribution is not less than the threshold . Here the threshold is an integer in the range ?n to n, chosen according to the probability density function (). We denote gi (W ; V ) as the coecient of eit in MX (t), where W = (w1 ; w2 ; :::; wn ) is the weight vector and V = (v1 ; v2 ; :::; vn ) is the corresponding input vector, such that MX (t) =

n X

?

i= n



i 

n X

?

= n

()

0 1 0

0.5 Input Value 2 0.5 Input Value 1 1

0

Standard Sigmoid Neuron

Output

gi (W ; V )e

it

n X i=

1

0.5

We have the probability that a BSN outputs 1 as follows X Po = gi (W ; V )() =

Bit Stream Neuron

gi (W ; V )

!

A comparison of the activation function of a BSN and that of a standard sigmoid neuron is shown in Figure 1. The BSN neuron has ten positive input bit streams with threshold  = 5. In order to view the output in a three-dimensional space, the inputs are divided into two groups of ve and the inputs in the same group are assumed to have equal values. In a stochastic bit stream neuron an important restriction is that the weight values must lie between -1 and 1. Compared with a standard sigmoid neuron which allows arbitrary weight values, the limitation of the weight values in a BSN reduces the range of inputs that can be e ectively discriminated by a single neuron. The limitation of the weight values in a BSN is di erent from that in a standard sigmoid neuron. The output of a BSN can be any value in the range [0,1] while all the weights are limited in the range [-1,1]; but the limitation of the weight values in a standard sigmoid neuron will cause the change of the range of the output value. Another interesting e ect of the limitation of the weight values is that it will force the representation of the information to be distributed over the whole network during the learning process. In a multilayer pecerptrons overlarge weights will usually destroy the learning. Weight decay is an ecient technique to avoid the weights being too large. However large weights may be necessary in some cases. This restriction of the functionality of a single bit stream neuron can be overcome by increasing the degree of

0 1 0.5 -1

0 Input Value 2

-0.5 0 Input Value 1

-0.5 0.5 1

-1

FIGURE 1. Outputs for Bit Stream Neuron and general sigmoid neuron.

parallelism.

2.3. Approximation by Gaussian Distributions

The activation function of a BSN based on the moment generating function description is not very convenient for the calculation when the number of its inputs is large. It is necessary to simplify the calculation by a introducing reasonable approximation. Because the total net input contribution is the sum of independent contributions from the weighted inputs (on the condition that each input is generated independently), when the number of inputs is large the distribution of the sum can be estimated by a Gaussian distribution. Therefore we may describe the activation function as a cumulative Gaussian distribution function. The following theorem provides a more ecient way to calculate the activation function for a BSN with a large number of inputs.

Theorem: For an AND-BSN B(W , V , ()) with a large number of inputs n, where W = (w1 ; w2 ; :::; wn ) is the weight vector, V = (v1 ; v2 ; :::; vn ) is the corresponding input vector and () is the threshold distribution function, if lim n!1

n X i=1

(jwi vi j ? wi2 vi2 ) = 1

n  X

?

= n

()

Z1 

x? p 1 e?  dx 2 (

)2 2 2



=1

This section gives the main results of the paper, describing in turn how learning may be implemented at the di erent description levels presented in the last section. To simplify the learning process we only consider the case that all threshold distributions are concentrated on particular values. In other words, each neuron in the network has a deterministic threshold.

3.1. Derivative Calculation

Clearly, the crucial point for implementing stochastic gradient descent learning is the calculation of the derivative. We denote the activation function of a single neuron by Po (p1 ; p2 ; :::; pn ; ). For notational convenience, here pi is equal to the weighted input wi vi , and  is the threshold value. If the weights are limited in the range [-1,1] and inputs are all in the range [0,1], jpi j is the probability of a bit having value 1 (pi  0) or -1 (pi < 0) in the weighted input sequence Xi . To calculate the partial derivative @Po (p1 ; p2 ; :::; pn ; )=@pj we rewrite the moment generating function as follows i=1

2

e?t + (1 ? jpi j) +



n 1

3. LEARNING IN BIT STREAM NEURONS

n  Y jpi j ? pi



? X

2 2

The proof of the theorem is achieved by using the Central P Limit Theorem. If the condition limn!1 ni=1 (jwi vi j ? wi2 vi2 ) = 1 is satis ed and all the elements in the vectors W and V are bounded with constants, the Lindeberg condition Pn (X ?is then satis ed. Thus the standard sum i i=1 p E (Xi ))= V (X ) is asymptotically normal. The details of the proof are omitted here, because they do not provide any signi cant insight. The approximate function ts the real one extremely well when the number n is large. It even works for a number n as small as 10. This Gaussian approximation gives us a more ecient way to train a large BSN network.

MX (t) =



= jpj j 2? pj e?t + (1 ? jpj j) + jpj j 2+ pj et 

pPn (jw v j ? w v ) and i i i i i

whereP =  = ni=1 wi vi



= jpj j 2? pj e?t + (1 ? jpj j) + jpj j 2+ pj et   jpij ? pi n Y jp j + pi et  e?t + (1 ? jpi j) + i 2 2 i=1;i6=j

and jwi j  1; jvi j  1 for i = 1; 2; :::; n, the activation function of neuron B(W , V , ()) can be estimated by the function

jpi j + pi et  2

=

? X

n 1

? ?

hi eit

i= (n 1)

 jp j ? p j

? ?

i= (n 1)

2

j

hi e(i?1)t + (1 ? jpj j)hi eit +

jpj j + pj h e i

( +1)t



i 2 The activation function is given by

Po (p1 ; p2 ; :::; pn ; )

= jpj j 2+ pj h?1 + (1 ? jpj j)h + jpj j 2+ pj h +

?  jp j ? p X jp j + pj h  j j hi + (1 ? jpj j)hi + j i

n 1

2

i= +1

= jpj j 2+ pj h?1 + (1 + pj ?2 jpj j )h +

2

? X

n 1

hi

i= +1

Obviously there is no pj in the coecients hi 's. We can take the partial derivative h @Po (p1 ; p2 ; :::; pn; ) if pj > 0  ?1 = h if pj < 0 @pj The coecient hk is simply the probability that the sum of all weighted inputs except the j th is equal to k. So we get @Po(p1 ; p2 ; :::; pn ; ) = @pj =1;i6=j Xi =  ? 1) i=1;i= 6 j Xi =  )

 P r ( Pn Pi P r( n

if pj > 0 if pj < 0 The calculation of the above partial derivative can be easily implemented in hardware (van Daalen , 1990). Therefore stochastic gradient descent learning in the BSN networks has a very exciting potential for on-chip learning. et

al.

3.2. On Chip Learning

The results presented above demonstrate that the derivatives of a neuron's output with respect to a particular weighted input j is simply the probability of the sum of all weighted inputs except the j th is equal to  ? 1 (for positive weighted input j ) or  (for negative weighted input j). Thus parallel hardware

implementation of the calculation of derivatives can be done by monitoring the total net inputs to the neuron (or the counter). When the total is exactly equal to  ? 1 the bit streams represented the derivatives for the positive weighted inputs with current states `0' and for the negative weighted inputs with current states `-1' give an output 1. When the total is exactly equal to  the bit streams represented the derivatives for the positive weighted inputs with current states `+1' and for the negative weighted inputs with current states `0' give an output 1. The easy solution for derivative calculation raises the question of whether a full backpropagation functionality for a network of neurons can also be realised in the hardware. Surprisingly the answer to this question is yes, though with some restrictions placed on the network architecture. To indicate how this can be done, we consider the analysis of back-propagation presented by Werbos (1993). His dynamic feedback approach views the backpropagation phase as calculating the derivative of the error(or appropriate nal value) with respect to the corresponding parameter of the feedforward calculation. In our case the value needed at neuron j will be denoted by @E j = @Oj

ij

where wij is the weight connecting neuron j to neuron i, @Oi =@pij is the derivative of the output of neuron i with respect to its net input pij = wij Oj from neuron j . Note that the representation used for the back propagation values must be signed. The multiplication of the corresponding values can be performed by an AND gate. The summation is simply a linear neuron, which as indicated in van Daalen (1990) can be implemented by choosing the threshold values uniformly at random in the range n to -n for an n input neuron. The neuron gives the weighted sum required but multiplied by a factor 1=(2n + 1). Hence, in order for the back-propagation values to be correctly summed, there must be a regular connectivity between the layers. This will result in the 0 s of earlier layers being reduced by a factor which is constant for the particular layer. The weights are updated by an amount et al.

wij = ? j

3.3. Moment Generating Function Simulation In this section we investigate the learning algorithm for the feedforward BSN network at the probability level. Unlike the general sigmoid activation function used in a multilayer backpropagation network, the activation function for a BSN is not simply a function of the sum of all its weighted inputs. This brings us some diculties in designing a learning algorithm. We need to consider the a ect of each weight separately. According to the generalized delta rule, the change of a particular weight should be proportional to the contribution of that weight on the total error E . @E wji = ? @w

ji

where E is the output error and Oj the output of neuron j. The value of j for a non-output neuron is given by X @O j = wij i i ; @p i>j

be compensated for. The actual update would be made by allowing each bit of the stream for wij to adjust the least signi cant bit of the corresponding binary weight value. The value would be increased if a 1 bit was received and reduced otherwise. The update rate could also be scaled by adjusting more signi cant bits in the weight value.

@Oj O @pij i

where is the learning rate. By adjusting the learning rate the reduction of the  in earlier layers can

where is the learning rate. By the chain rule @E @Oj @pji @E = @wji @Oj @pji @wji

where Oj is the output of the th neuron, and pji = wji Oi is the weighted input from neuron i. Let the @E `credit' for the th neuron be j = ? @O , then j j

j

@Oj @Oj @pji = j Oi @p wji = j @p @w ji

ji

ji

The partial derivative of the output of the neuron @Oj with respect to its weighted input @p can be calcuji lated using the method showed in section 3.1. The `credit' j is the value used in the back propagation phase. If the th neuron is an output unit and E is de ned as a mean square error function (T ? Oj )2 E= j 2 where Tj is the target value for the j th neuron, then j

j = ?

@E = Tj ? Oj @Oj

If the th neuron is a hidden unit then j

j = ?

X @Ok @O wkj = k wkj k @Ok @pkj @p kj k

X @E k

where k is the index of a neuron to which the th @Ok neuron connects. Again the partial derivative @p kj can be calculated according to the proposed method. j

3.4. Simulations using the Gaussian Approximation

We have proposed a Gaussian approximation for the BSN activation function when the number of its inputs is large. Now we derive the learning method for a large BSN network using this Gaussian approximation. For a BSN with a large number of inputs, if the threshold distribution is concentrated on a value  its output is approximately as follows. Z 1 1 x? p e?  dx Po (p1 ; p2 ; :::; pn ; )  2  P n where the p mean Pn  (=jp j ?i=1p2p).i and the standard deviation  = i i i=1 If we use the cumulative Gaussian distribution function as an approximate BSN activation function the derivative used for gradient descent learning should also be derived from it. According to the chain rule we have (

@Po @pj

)2 2 2

@Po @ o @  @P + @ @pj @ @pj   ? = p ?  e?   1 @ jpi j ? 2p (

22

)2 2 2

+ p 1 e? 2



2

@pi

i

)2 2 2

3.5. Features of the BSN Networks

At the probability level the BSN networks are very similar to the multilayer backpropagation networks with local `temperatures', ie. each neuron in the network has a particular temperature value. The standard deviation  in a BSN corresponds to the temperature in a general sigmoid neuron. Its value is determined by the weighted inputs to the neuron. This brings the network a very interesting feature: The medium size weights will automatically be penalized during the learning and all the weights will tend to approach -1, 1 or 0. This feature has been con rmed by the simulation results on several small problems (PARITY and ENCODE). Another feature of a BSN network is that its output surface is usually quite at(Figure 2). A large area is around probability half. It means the outputs of the network for many inputs are `uncertain'. This is caused by the limitation of the weight values in the network. We hope this feature will bring a good generalisation performance of the network.

?

( )2 22



?  ?  @ jpi j  ?  = p 1 e?  22 @pi ? 2 pi + 1 2 A gradient descent learning similar to the one presented in the last section can be applied to a large BSN network using the Gaussian approximation. The simulation with Gaussian approximation is much more ecient than the one using the moment generating function description for a large network. However, it is very time consuming to calculate the cumulative Gaussian distribution function and its derivative. To further accelerate the learning process two look up tables have been set to evaluate the following two functions. 1 Z x e? u du; f1(x) = p 2 ?1 1 x f2 (x) = p e? 2 In general if the number of the inputs to each neuron in the network is large than 15, the simulation results with Gaussian approximation can be directly (

applied to the corresponding network using the moment generating description. Only in few cases the network needs to be retrained based on the results obtained in the simulation with Gaussian approximation. The retraining process is fairly easy because only few neurons fail to reach the expected values and the di erences between their outputs and the expected values are usually quite small.

2 2

2 2

4. SIMULATION RESULTS 4.1. Monk's Problem

The Monk's problems which were the basis of a rst international comparison of learning algorithms, are derived from a domain in which each training example is represented by six discrete-valued attributes. Each problem involves learning a binary function de ned over this domain, from a sample of training examples of this function. The `true' concepts underlying each Monk's problem are given by: MONK-1: (attribute1 = attribute2) or (attribute5 = 1) MONK-2: (attributei = 1) for EXACTLY TWO i 2 f1; 2; :::; 6g MONK-3: (attribute5 = 3 and attribute4 = 1) or (attribute5 6= 4 and attribute2 6= 3) There are 124, 169 and 122 samples in the training sets of MONK-1, MONK-2 and MONK-3 respectively. The testing set has 432 patterns. The network

XOR problem solved by MLP.

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 1 0

0.5 0.5 1

0

XOR problem solved by BSNN.

1

0.5

0

network. The database consisted of a clean set of 500 16  16 pixel samples. 400 samples were used for training and the remaining 100 samples were used for testing. Some typical ones are shown in Figure 3. The fully connected feedforward BSN network had 256 input units, 24 hidden units and 10 output units. There were 6418 connections in total. The simulation was carried out at the probability level and the criterion used for convergence and testing is that the sum of the absolute errors of the output units is less than 0.5. The network was successfully trained on 400 input patterns with a rate of 99.3% correct. The trained BSN network had a very good generalisation performance: 90.0% correct on the testing set. The generalisation performance of the BSN network with di erent numbers of training samples is compared with that of a general feedforward multilayer backpropagation network of the same structure. The simulation result is shown in Figure 4.

1 0

0.5 0.5 1

0

FIGURE 2. Output surfaces for the XOR problem obtained by a multilayer perceptrons and a BSN network.

BP Network BSN Network training testing training testing MONK-1 100% 86.6% 100% 97.7% MONK-2 100% 84.2% 100% 100% MONK-3 97.1% 83.3% 98.4% 98.6% It can be seen that the generalisation of the BSN network is much better than that of a general multilayer backpropagation network. The results on MONK-3 problem is extremely good. The results reported by Hassibi & Stork (1993) using a sophisticated weight pruning technique is only 93.4% correct for the training set and 97.2% correct for the testing set.

4.2. Handwritten Digit Recognition

A real handwritten digit recognition task was chosen to test the learning and the generalisation of BSN

FIGURE 3. Some samples of the handwritten digits.

0.9 BSNN BPN 0.8

0.7

0.6

Generalisation

used 17 input units, 10 hidden units, 1 output unit, and was fully connected. Experiments were performed with and without noise in the training examples. There is 5% additional noise (misclassi cations) in the training set of MONK-3. The results for the Monk's problems using the moment generating function simulation are shown as follows:

0.5

0.4

0.3

0.2

0.1

0 0

50

100

150 200 250 Number of training samples

300

350

400

FIGURE 4. Generalisation of the BSN network and BP network for handwritten digit recognition.

5. CONCLUSIONS This paper has presented three levels of description at which learning algorithms can be applied to networks of BSNs. At the hardware level a demonstration of derivative calculation has shown full back-

propagation could be implemented. At an intermediate level of description, using moment generating functions, successful training has been reported on standard test problems. In these cases the solutions showed pleasing probabilities by giving exact outputs rather than approximations. Finally, a third level of description using Gaussian approximations to the functionality of the neurons was used to train large scale networks to solve a real problem involving handwritten character recognition with a training set of 400 characters with 100 used to estimate generalisation performance. The paper has supported the claim that networks of bit stream neurons have comparable functionality to networks of standard sigmoid neurons, while training these networks is feasible for large scale simulations. At the same time there is the prospect of implementing backpropagation in the neural (digital) hardware.

Acknowledgements

This research was supported by the Sino-British Friendship Scholarship Scheme. We thank Je rey Wood for many helpful discussions.

REFERENCES van Daalen, M., Jeavons P. & Shawe-Taylor, J. (1990). Probabilistic Bit Stream Neural Chip: Implementation. International Workshop on VLSI for AI and NN. van Daalen, M., Jeavons P., & Shawe-Taylor, J. (1993). A stochastic neural architecture that exploits dynam-

ically recon gurable FPGAs. IEEE Workshop on FPGAs for Custom Computing Machines, (pp. 202{ 211). Darken, C. & Moody, J. E. (1992). Towards Faster Stochastic Gradient Search. Advances in Neural Information Processing System 4, pp. 1009-1016. Gains, B. R. (1969). Stochastic Computing System. Advances in Information System Science, New York: Plenum. Hassibi, B. & Stork, D. G. (1993). Second order derivatives for network pruning: Optimal Brain Surgeon. Advances in Neural Information Processing System 5, (pp. 164-171). Hertz, J., Krogh A. & Palmer, R. G. (1991). Introduction to the Theory of Neural Computation, Addison-Wesley Publishing Company. Jeavons, P., Cohen D. & Shawe-Taylor, J. (1994). Generating Binary Sequences for Stochastic Computing. IEEE Transactions on Information Theory, 40(3), 716{721. Kondo, Y. & Sawada, Y. (1992). Functional Abilities of a Stochastic Logic Neural Network. IEEE trans. on Neural Networks, 3, 434{443. Murray, A. & Tarassenko, L. (1994). Analogue Neural VLSI | A pulse stream approach. Chapman & Hall. Rumelhart, D. E. & McClelland J. L. (1986). Parallel Distributed Processing: Explorations in the microstructure of cognition (vol. 1), Cambridge, MA: The MIT Press. Shawe-Taylor, J., Jeavons P. & van Daalen, M. (1991). Probabilistic Bit Stream Neural Chip: Theory. Connection Science, 3(3), 317{328. Tomberg, J. E. & Kaski, K. K. K. (1990). Pulse-density modulation technique in VLSI implementations of neural network algorithms. IEEE J. Solid-State Circuits, 25, 1277{1286. Werbos, P. J. (1993). The Roots of Backpropagation. John Wiley and Son, 1993. Zhao, J. & Shawe-Taylor, J. (1995). A Recurrent Network with Stochastic Weights. (submitted to IEEE trans. on Neural Networks)