CHARACTERISTICS OF MULTI-LAYER PERCEPTRON ... - CiteSeerX

Report 1 Downloads 78 Views
CHARACTERISTICS OF MULTI-LAYER PERCEPTRON MODELS IN ENHANCING DEGRADED SPEECH T.T. Ley, J.S. Masony & T. Kitamuraz

y Dept.

of Electrical & Electronic Eng. University College of Swansea, SWANSEA, SA2 8PP, UK ABSTRACT

A multi-layer perceptron (MLP) acting directly in the time-domain is applied as a speech signal enhancer, and the performance examined in the context of three common classes of degradation, namely non-linear system degradation (introduced by a low-bit rate CELP coder), additive Gaussian white noise, and convolution by a linear system. The investigation focuses on two topics: (i) net topology, comparing single and multiple output structures, and (ii) the in uence of nonlinearities within the net. Experimental results con rm the importance of matching the enhancer to the class of degradation. In the case of the CELP coder the standard MLP with its inherently non-linear characteristics is consistently better than any equivalent linear structure. In contrast, when the degradation is from additive noise, a linear enhancer is always superior. Interestingly in both cases nets with multiple outputs give signi cantly better performance than singleoutput structures.

z Dept.

of Electrical & Computer Eng. Nagoya Institute of Technology, NAGOYA, 466, Japan

for speech degraded in di erent ways. Three standard forms of signal degradation are considered, namely non-linear degradation introduced by CELP, additive Gaussian white noise (GWN), and linear system degradation. We begin with the CELP degradation since this is likely to present the sternest challenge. The error here is known to be correlated with the speech signal, making the performance of any enhancer text-dependent.

1 Introduction The non-linear enhancer used in this work is a standard multi-layer perceptron (MLP) structure con gured as a time-domain lter. In this respect the work can be compared with that of [1], [3], and [7], all of whom use similar time-domain structures. However, there would seem to be a lack of justi cation, both theoretical and experimental, for the use of such structures over for example simple linear lters. In fact this paper shows, through a series of direct experimental comparisons, that only in the case of non-linear degradation is the non-linear enhancer likely to be superior.

1.1 Objectives The primary goals of this paper are to: 1. identify the most appropriate structures for speech enhancement; 2. demonstrate the bene ts, if any, of non-linearity in the enhancer;

Figure 1: [top] Original speech signal, si , [middle] CELP generated speech, di , and [bottom] corresponding error signal, ei = si ? di. Clearly correlation exists between the original and the error This is clear from Figure 1 which shows an example of CELP generated speech [middle], the original speech [top], and the corresponding error signal; in practice the error signal is itself found to be intelligible. The correlation means that the performance of any enhancer will be text-dependent [4]. As a consequence the experiments presented in this paper relate to a restricted speech database, namely /e/ utterances from 40 speakers. It is postulated that in a practical application multiple structures would be needed, one for each class of speech.

2 MLP as a Non-linear Predictor The application of an MLP con gured as a timedomain lter was proposed by Tamura and Waibel [7], and Kaouri and Lin [3]. The approach can be viewed as mapping a set of noisy signals to a set of signals with reduced noise. In this section, we consider three fundamental aspects of MLPs, namely the non-linearity, the net size, and the net topology.

block of input samples to a block of ltered output samples of equal size. With this model, the number of elements in all layers are variable, but the number of elements in the input layer and the number in the output layer are kept identical. filtered speech sample

filtered speech samples

noisy speech signal samples

noisy speech signal samples

2.1 Non-linearity Non-linearity in MLPs is normally in the form of the activation function f(x) which should be di erentiable and monotonically increasing to accommodate gradient and error back-propagation techniques [5, 6]. Many activation functions have been proposed [2]. Examples include the sigmoid 1=(1 + exp(?x)), the bipolar sigmoid 2=(1 + exp(?x)) ? 1, the squash x=(1 + jxj) and the error function erf (x). Authors such as [1, 3, 7] use the standard sigmoid function and claim good performance in their speech enhancement application. However, a comparison with linear lter structures in the context of di erent degradations seems not to have been reported. Thus one question we pose is whether or not the non-linearity is important. Consequently both linear and non-linear functions are considered in otherwise identical structures. In the non-linear case, a bipolar sigmoid is chosen since it more readily matches the bipolar speech input. In the linear case the sigmoid activation function is replaced by a simple linear term ie a constant slope. Of course in this case there is no theoretical justi cation for multiple layers, and therefore we begin by examining single layer structures, with and without non-linearities. In all cases, linear and non-linear, the structures are trained using the standard error back-propagation, conjugate gradient, batch update algorithm. This means that minimal changes are required when switching from a non-linear to a linear structure: the sigmoid is simply replaced by a straight-line function of unity slope. The learning process is terminated by using crossvalidation so as to limit the e ects of possible overtraining [2].

2.2 Net Size and Topology Two topologies have been used throughout this work (Figure 2). The rst is similar to a conventional linear ltering structure in that a block of input samples is used to give a single ltered output sample. In other words, while the number of elements in both input and hidden layers are variable, the number of output elements is always one. The second model maps a

Figure 2: Single and multiple output net structures In order to identify an appropriate net size we start with a minimum sensible size and then increase the number of nodes, with restrictions mentioned above, to reveal a trend in performance.

3 Performance Assessment Ideally the task of the enhancer is to improve perceptual aspects of the speech signal. However, subjective assessment is very dicult and so throughout this work we resort to the commonly used objective measure of segmental signal-to-noise ratios, SNRseg , acknowledging the limitations thereof. SNRseg is de ned as: N SN R = 1 SN R seg

X

N

f

f =1

where N is the number of segments (or frames) and SNRf is the signal-to-noise ratio of frame f .

XT i XT i( i ? i) s

SN R

f = 10log10

2

=1

i=1

s

d

2

where si is the original `clean' speech, di is degraded speech signal, and T is the number of speech samples per frame.

4 Degraded Speech For the purpose of comparison, the three cases are arranged to have approximately equal SNR, governed

by that of the chosen CELP scheme, details of which are: . excitation length 64, VQ codebook size 1024 . LPC order 10, VQ codebook size 256, . LPC window length 256, . no pre-emphasis, . no pitch prediction, giving an SNRseg of 7.4dB. The resultant speech can be classed as synthetic or machine like, with an observed loss of natural pitch variations. The other two classes of degradation are arranged to have similar SNRs to that of the CELP. In the case of GWN the SNR simply adjusted to give 7.4dB, and in the case of the linear system, H (z) = G(1 + z?n ), both G and n are again experimentally determined, n = 12 and G = 0:7, to give a similar SNR. In some respects this degradation may be regarded as somewhat arbitrary and unrealistic with its notch characteristics. Nonetheless it provides a stern task for the enhancer, with its associated very long inverse function.

5 Experiments and Results 5.1 Single Layer The rst experiments relate to single layer structures and degradation with additive GWN. For the linear case (straight line activation function), single layers are theoretically sucient, and so these experiments set the bench-marks for such structures, to be met subsequently when considering nets with hidden layers. We test the four basic variants however (linear, non-linear, single and multiple output), to establish the number of inputs likely to give good performance. 7.0 14.4

13.4

linear

12.4

non-linear

6.0 13.4 5.0 12.4

SNR (dB)

multi-op

4.0 11.4

11.4 single-op

10.4

3.0 10.4

9.4

2.0 9.4

8.4

1.0 8.4

7.4

4

1 2

10 40 80 120 8 20 60 100 No. of Inputs

improvement (dB)

14.4

0.0 7.4

Figure 3: SNRseg performance of linear and non-linear structures with a single layer, original SNRseg 7.4 dB additive GWN degradation. The corresponding four pro les are shown in Figure 3. It is clear that in all four cases there is a `knee' at

around 40 to 60 inputs, spanning 40 to 60 ms, and that the results for multiple output structures are signi cantly better than those for single outputs, eg improvement of 6.5dB compared with just 4.3dB when the number of outputs > 60. It is also clear that the linear systems out-perform the equivalent non-linear ones, at least in the context of GWN and single layer structures; the di erences are small but consistent. This line of investigation is continued in the following section with net structures which include a hidden layer and the other two degradation classes.

5.2 In uences of Non-Linearity We now consider structures with one `hidden' layer, in an attempt to realise the full potential of the MLP structures. Two sets of almost identical structures, with the sole di erence that in the MLP sigmoid is replaced by a unity-slope, straight-line function, are applied to the three classes of degradation in turn. This facilitates direct comparison, with minimal alterations to the structures, and the linear structures should give equivalent results to the single layer cases reported above. Results are shown in Figure 4. In the case of degradation by the CELP coder, MLPs with their inherent non-linear characteristics consistently provide better performance than linear equivalent systems, Figure 4a; for example for net structure of 60-60-60 the SNRseg achieved by non-linear and linear MLP is 10.6dB and 9dB respectively, giving the improvement of 3.2dB and 1.6 dB. In contrast, for degradation by additive GWN, Figure 4b, linear structures perform better than nonlinear ones, as indeed they do in the case of single layer structures, Section 5.1. For example from Figure 4b, with net structure of 60-60-60 non-linear and linear MLP give an improvement of 6.0dB and 6.4dB respectively. There is less consistence in the case of degradation by the linear system. Figure 4c shows that in this case the number of net inputs is the over-riding parameter; the larger the number of input the better performance. This is to be expected given the nature of the degradation function: 0:7(1 + z?12 ). The optimum inverse function has an exceedingly long impulse response.

5.3 In uence of Net Topology Figure 4a and Figure 4b show that MLPs with multiple outputs (model 2) give consistently better performances compared with MLPs with a single output (model 1). This agrees with [3] and [7]. This nding is attributable to the nature of the output, in this case a time-domain speech signal, and is in contrast to the situation where the outputs represent independent

improvement (dB) 0

1

3

2

4

60-60-60 40-40-40 multi-output

20-20-20

6 CONCLUSION

non-linear

10-10-10

linear

The primary ndings of this experimental study are:

60-60-1 40-80-1

single-output

80-10-1 CELP

40-10-1 7.4

8.4

(a)

9.4 10.4 SNR (dB)

11.4

improvement (dB) 0

2

4

6

8

60-60-60 40-40-40 20-20-20 10-10-10

multi-output

60-60-1

single-output

80-10-1 GWN

40-10-1 9.4

(b)

11.4 13.4 SNR (dB)

15.4

improvement (dB) 0

2

4

6

8

60-60-60 40-40-40 20-20-20

multi-output

10-10-10

60-60-1 single-output

40-80-1 80-10-1

H(z)

40-10-1

(c)

7.9

1. a xed enhancer can improve low-bit rate CELP speech quality; 2. in the case of additive GWN a linear structure (ie a straight forward linear lter) is preferable to an MLP; In contrast, for low bit-rate CELP coder degradation an MLP gives better performance; 3. MLPs with multiple outputs give signi cantly better performance than structures with a single output, for CELP and GWN degradation. The results presented highlight the importance of matching the enhancer to the characteristics of the degradation, linear to linear and non-linear to nonlinear.

REFERENCES

40-80-1

7.4

class in the more common MLP classi cation applications such as speech and speaker recognition. In the case of speech signal outputs, obvious correlations provide useful training information in model 2, unavailable in model 1.

9.9

11.9 13.9 SNR (dB)

15.9

Figure 4: Each pair of bars represents SNRseg performance of linear and non-linear nets of the same topology; 40-10-1 means 40 nodes in the input layers, 10 in hidden layer, and 1 output.

[1] T. Fechner. Nonlinear noise ltering with neural networks: comparison with Wiener optimal ltering. Proc. IEE conference on Arti cial NN, May 1993. [2] D. Hush and B. Horne. Progress in supervised neural networks. IEEE Signal processing magazine, pages 8{ 39, January 1993. [3] H. A. Kaouri and M. L. Lin. Enhancement of coded speech signals using arti cial neural network techniques. IEE Proc. 6th ICPSC, pages 230{234, 1988. [4] T. T. Le. Speech enhancement with non-linear prediction. M.Phil. Thesis, University College Swansea, 1993. [5] R. P. Lippmann. An introduction to computing with neutral nets. IEEE ASSP Magazine, pages 4{22, April 1987. [6] D.E. Rumelhart, J.L. McCelland, and the PDP Research Group. Parallel distributed processing, volume 1. MIT Press, 1986. [7] S. Tamura and A. Waibel. Noise reduction using connectionist models. Proc. ICASSP-88, 1:553{556, 1988.