Automated Speech Discrimination using Frequency Derivative Threshold Detection Christopher J. Page and Syed Mahfuzul Aziz School of Electrical and Information Engineering, University of South Australia Mawson Lakes, SA 5095, Australia E-mail:
[email protected];
[email protected] Abstract Traditionally the discrimination of speech from non-speech signals within audio has been done through the use of amplitude threshold levels or energy levels within the audio signals. While these methods along with other newer methods have been shown to provide effective results in discriminating speech, this paper aims at taking a different perspective in speech discrimination. In general the variations in frequency of a person’s voice over small time periods are within a limited range. In contrast many undesirable background noises tend to have highly varying frequency levels. Hence by analysing the change in frequency over time an effective means can be found for identifying what is human speech and what is noise. This paper aims at using this form of speech discrimination in order to develop an efficient method for isolating speech from non-speech signals.
1. Introduction With modern speech recognition softwares being able to accurately transcribe speech to text, the use of these systems is becoming widespread in different areas including the automated transcription of meetings. For example, the Australian Defence Science and Technology Organisation (DSTO) uses such a system called Automatic Transcriber of Meetings (AuTM) in their Intense Collaboration Space (ICS) facility [1]. AuTM uses a system of microphones around a room to record utterances spoken during a meeting and then transcribe these recordings as meeting minutes. The system aims to allow for a meeting to be recorded as a wave file and then transcribed to text as meeting minutes. Automated transcription systems similar to AuTM are likely to suffer from problems arising out of a number of issues. The most notable ones are: the threshold detection problem (or speech discrimination problem), i.e.
working out what is voice and what is noise; and the speaker separation problem when two speakers talk at the same time, also known as co-channel speakers. A Solution to the second problem, the co-channel speaker separation, has been attempted by subjecting the recorded audio to a pre-processor employing an Adaptive Decorrelation Filter (ADF) [2]. The first problem of threshold detection can be attempted using a variety of methods including amplitude levels, energy levels [3], and cepstral analysis [4]. Currently AuTM uses a system of measuring energy levels to discriminate against noise. When a person speaks, the energy within the voice is compared to a set threshold. If the threshold is met, the audio is deemed to be speech [5]. The main problem with this method is that often background noise can have very high levels of energy within it, and hence false detections are made. Within the AuTM system false detection will both cause the audio to be placed in a queue to be transcribed, considerably adding to the load on the system, and also cause the background noise to be stored as part of the meeting. Many methods have been devised to overcome the shortcomings associated with measuring just the energy levels. Some of these methods include using wavelet transforms [6], Linear Discriminate Analysis [5], as well as using frequency and cepstrum measurements [7]. However, most of these techniques require many steps, and as such are computationally intensive [6]. This paper aims at improving the performance of speech recognition systems by a simple pre-processing algorithm in order to quickly discriminate speech from non-speech signals. It examines the change in frequency over time to determine if the signal is speech or not. Unlike the other methods stated earlier, this process requires less computation, and as such makes an effective first stage pre-processor. Another issue in speech recognition [8] is that of identifying the end points of speech in environments
6th IEEE/ACIS International Conference on Computer and Information Science (ICIS 2007) 0-7695-2841-4/07 $25.00 © 2007
2. The threshold detection algorithm 2.1. Variation in frequency The human body produces sound when air from the lungs is forced up through the vocal tract which vibrates to create what we hear as voiced speech [7]. If a small portion of the sound created (for example, over a 30ms period) is analysed, it can be seen that the vocal tract acts as a linear time-invariant system [7]. Thus a Fourier transform can be used to move the signal into the frequency domain. It is then possible to determine the frequency derivative to measure variations in frequency over time. We carried out some simple tests by playing pieces of audio files and determined the changes in frequency over time by feeding these to a Matlab program. It was found that normal speech varied little in frequency, as opposed to unwanted background noises such as coughing, door slamming, paper rustling etc., which tended to vary greatly in frequency over time. Thus, using this theory, an algorithm can be developed which measures the change in frequency levels over the length of an audio, to identify whether the audio is speech requiring transcription, or just background noise and be discarded.
2.2. The threshold detection process From the definition of speech presented earlier, it can be seen that by analysing an audio file and finding the changes in frequency over time, a signal can be identified as either voice or noise. To measure these changes in frequency over the entire length of a recording, the audio must first be split into small frames, with each frame being transformed into the frequency domain. Once in the frequency domain the frequency with the highest amplitude can be found and then stored into an array. Finally the array of frequencies for the audio are analysed for changes over time, which will give us an insight into what can be considered as voiced speech or noise. To facilitate the measurement of the frequency response a window is applied to the frame of audio prior to any processing. When choosing a window two main factors must be considered, these are resolution and spectral leakage [9]. Spectral leakage can be
defined as the level of interference, or the size of the tails associated with the main peaks in the spectral analysis [9]. A small window size will decrease the resolution of the frequency response plot, which may cause important detail to be lost. On the other hand a large window will cause high levels of spectral leakage. Further to the window size, there exist a number of different window functions which can be applied. For this paper a Hamming window [9] was chosen as it provides a good trade off between reducing the side lobes without loosing too much of the original peak. An audio file is made up of a large number of frames, for example the audio used in the experiments we carried out was recorded at 11024 frames per second. After a number of tests a Hamming window of size 140 frames was selected. This window provided a good balance between resolution and spectral leakage. Figure 1 shows a frequency spectra plot for one frame of audio equivalent to 13 ms.
2
1.5
1
Amplitude
with a low Signal-to-Noise Ratio (SNR). The accurate identification of these points is necessary to achieve accurate transcription. As end point detection is a necessary part of a speech discrimination system, the method described in this paper is designed to work in conjunction with such a system.
0.5
0
-0.5
-1
0
100
200
300
400
500
600
700
800
Frequency (Hz)
Figure 1. Frequency response plot for a 13ms of voiced audio
2.3 Overlapping window technique One of the inherent problems with trying to measure the frequency response of a waveform is that sufficiently large frame sizes must be used to gain any resolution in the frequency domain. However the larger the frame size, the more difficult it becomes to find changes in the frequency domain over time. For example, the frequency spectrum plot obtained by analysing 13ms of audio shows only one major peak, corresponding to the speech at that time. However, if
6th IEEE/ACIS International Conference on Computer and Information Science (ICIS 2007) 0-7695-2841-4/07 $25.00 © 2007
50ms of audio in analysed, the frequency spectrum may show two main peaks, one corresponding to the speech which occurred after, say 20ms, and a second peak corresponding to some background noise which occurred at 30ms. In this situation the algorithm would identify the entire 50ms of audio as noise, whereas only the last 20ms is actually noise. To overcome this problem an overlapping technique was used, whereby after every 70 frames of audio, a 140 frame window was applied, and the signal was moved into the frequency domain. Figure 2 illustrates this theory whereby samples overlap one another so that a clearer picture of the changes in the frequency over time can be seen. This will improve the accuracy of the system. 1 0.9 0.8
Amplitude
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
0
50
100
150 200 Time (Frames)
250
300
350
Figure 2. Windows applied to input audio
2.4 Threshold levels As was mentioned in the introduction, the main aim of this speech discrimination algorithm is not to work independently, but to work as part of a larger system of speech detection including end point detection. Thus, if this algorithm of frequency derivation detects a signal to be noise, the rest of the system does not need to proceed with finding the exact end points for the speech. For all the tests we carried out a single threshold level was used. The threshold level can be equated to variations in the frequency domain with magnitude of greater than 65Hz over a 26ms period. While it may be possible to gain greater accuracy using a variable threshold system, the tests we conducted proved that the gains from such a system would be minimal, and also add to processing time.
2.5 Problems and further enhancements The first problem encountered by the speech detection algorithm was caused by the windowing technique presented above. While this technique improved the accuracy of the speech detection system, it also resulted in increased processing time. The speed at which the algorithm could process a wave file using the double windowing technique reduced to below real time. For example, for 1 minute of audio, the processing time would be just under 3 minutes. To overcome this problem, modifications were made to the way the algorithm found the changes in frequency. Originally the process was done in three steps: (i) moving into the frequency domain, (ii) finding the changes in frequency over time, and (iii) comparing these changes to a set threshold. To speed up the process these three steps were merged into one. As a result the processing time reduced significantly, taking under 50 seconds for 1 minute of audio. With these changes the system can be used in real time. As stated earlier, this method of speech discrimination is designed to work as part of a speech recognition system, in conjunction with an accurate method for end point detection. As such, most of the steps required for converting the audio into the frequency domain would be required anyway for the purpose of end point detection. Thus, it is envisaged that the time taken by the proposed speech discrimination algorithm to find variations in the frequency domain will be a very small part of the total time taken by the complete speech recognition system for speech discrimination, co-channel speaker separation and end point detection. The second major problem associated with the algorithm was that of interfering speakers. It didn’t take many tests before it was found that if two speakers spoke at the same time, the algorithm would start detecting large variations in frequency. This was caused by the difference in frequency between the two speakers’ voice, i.e. a male voice is generally much lower in the frequency spectrum than a female’s. This problem also exists if a background noise louder than the speaker’s voice interfered with the speech signal, causing a spike in frequency change. A solution to the problem of co-channel speakers was already available through the use of ADF, as stated previously. The ADF, along with another algorithm [10] used to detect co-channel speaker interference,
6th IEEE/ACIS International Conference on Computer and Information Science (ICIS 2007) 0-7695-2841-4/07 $25.00 © 2007
Input Audio (mainly noise) Plot of Input Audio
Input (voice) PlotAudio of Input Audio 0.5
1
0.5
0
0
-0.5
-1
-0.5 0
2
4
80
6
8
(Frames) TimeTime(frames) Change in Derivation frequenc y Frequency
10
12
14
1
2
60
40
40
20
20 0
2
4
6
8
10
12
Time (Frames)
Time (frames)
14 5
0
3
4
5
Time (frames) Time (Frames)
6
7
8 5
x 10
Change in frequency Frequency Derivation
80
60
0
0
5
x 10
0
x 10
1
2
3
4
5
6
7
Time (Frames) Time (frames)
(a)
8 5
x 10
(b)
Figure 3. Frequency derivation plots generated by the proposed speech detection algorithm for: (a) noise signals and (b) voice. Top graphs represent the input audio signals, and bottom graphs show the frequency derivation plots.
could sufficiently remove the interference caused by the other speaker, allowing for accurate analysis by the speech detection algorithm. The second problem of loud background noise proved to be much more difficult to address, however the problems it caused were found to be minor. By using a noise cancelling microphone, background noises were generally kept at a lower level than the original speaker’s voice. This meant that the amplitudes of noise frequencies were overshadowed by the amplitudes of the speaker’s frequencies. It was also noted that if an audio with loud background interference was fed into the transcription system, it would fail to accurately transcribe the audio. Therefore even if the speech detection algorithm had removed that portion of the audio, it would not have made a difference to the end transcription result. This finding led to another useful property of the speech detection algorithm. If large changes in the frequency domain were detected, even for a voice signal, the transcription system in many cases would be incapable of accurate transcription. As such, the algorithm could possibly be adapted for future use to notify a user of the transcription system that the audio recorded was going to be transcribed with poor accuracy. This information could then be used to decide whether to proceed with the transcription thereby potentially saving time spent for poor quality transcription.
3. Experimental results 3.1 Testing procedure As the main aim of this speech detection algorithm is to detect what is speech and what is background noise, a number of recordings were made with just continuous speech, as well as a number of recordings with common background noises. This was done so as to measure the rejection rate of the algorithm, and to find any circumstance where normal speech was rejected. The background noises recorded included door slamming, paper rustling, coughing, clapping etc. The speakers chosen for testing normal speech included both male and female, as well as both native Australian accents and one non-native Australian accent (Romanian origin, speaking English). Figure 3 shows the frequency derivation plots generated by the speech detection algorithm. The graphs on the left show the results for a wave file which contained only background noises including coughing, banging, dropping of microphone, paper rustling etc. The graphs on the right show the analysis of a wave file which contained only speech. As can be seen from the lower left hand graph, the variations in frequency within the wave file containing just noise include many large spikes, and generally high fluctuations. The vertical lines (from 0 to 1) in the top left hand graph illustrate when the variations in the frequency exceeded a set threshold, indicating that
6th IEEE/ACIS International Conference on Computer and Information Science (ICIS 2007) 0-7695-2841-4/07 $25.00 © 2007
the audio was not speech. As stated earlier the threshold used equates to variations in the frequency domain with magnitude of greater than 65Hz over a 26ms period. From the frequency derivation of the second wave file (lower right hand graph) it can be seen that the changes in the frequency domain are much smaller in comparison. Also, there are no vertical lines in the top right hand graph, indicating that the entire file did not have any variation in frequency above the set threshold.
3.2 Results The results obtained from the tests conducted using the algorithm are summarised in Table 1. From over 800 spoken words only 8 were mistaken for noise, providing an accuracy of close to 100%. On the other hand, the accuracy of noise detection for 20 different common background noises ranging from paper rustling to door shutting was around 85%. Two different microphones were used in the tests, one was a high quality Andrea USB microphone [11], and the other an ordinary microphone utilising the computer’s sound card. All of the 8 words which were mistaken for noise were recorded using the ordinary microphone, while the USB Andrea microphone suffered no errors. The errors in detecting noise did not seem to be affected by either microphone. Table 1. Test results Audio type Voice Noise
No. of words/ noises 800 20
Accuracy Andrea USB Microphone 100% 80%
Ordinary microphone 99% 85%
4. Conclusions In this paper a new method for the discrimination of speech and noise signals has been presented. It utilises measurements made in the frequency domain to find changes in frequency levels over time. While this method is not intended to work alone, the tests conducted show that the algorithm is very effective in detecting what is speech and what is noise. With speech detection accuracy at close to 100%, the addition of this algorithm to any existing speech detection system would certainly add to its effectiveness, while helping reduce computational load. For example, it can be combined with the end point detection system reported in [5] to create a new
system, capable of accurate end point detection, along with a robust system for determining whether the audio is voice or too noisy for speech transcription.
5. References [1] S. Graham and J. Sladek, “An Automatic Transcriber of Meetings Utilising Speech Recognition Technology”, DSTO Internal Report, Doc. No. DSTO-CR-0355, March 2004. [2] K. Dogancay, J. Littlefield and A. Hashemi-Sakhtsari, “Performance Evaluation of an Automatic Speech Recogniser Incorporating a Fast Adaptive Speech Separation Algorithm”, Proc. of the 2003 Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP2003), W-C Siu (Eds), 2003, Hong Kong, pp. 205-208. [3] S. Tsurufuji, H. Ohnishi, M. Lida, R. Suzuki, Y. Sumi, “A Voice Activated Car Audio System”, IEEE Transactions on Consumer Electronics, Vol. 37, No. 3, 1991, pp. 592 -597. [4] R. Chengalvarayan, “Robust Energy Normalization using Speech/Non-speech Discriminator for German Connected Digit Recognition,” Proceedings of Eurospeech, September 1999, pp. 61–64. [5] A. Martin, D. Charlet, L. Mauuary, “Robust Speech/Non-Speech Detection using LDA Applied to MFCC”, France Telecom R&D, France, 2001. [6] J. Kacur, J. Frank, G. Rozinaj, “Speech Detection in The Noisy Environment Using Wavelet Transform”, Proceedings of EURASIP Conference, Zagreb, Croatia, 2003. [7] B. St. George, E. Wooten, L. Sellami, Speech Coding and Phoneme Classification Using MATLAB and NeuralWorks, Department of Electrical Engineering U.S. Naval Academy, Annapolis, USA, 1998. [8] B. Wu, K. Wang, “Robust Endpoint Detection Algorithm Based on the Adaptive Band-Partitioning Spectral Entropy in Adverse Environments”, IEEE Transactions on Speech and Audio Processing, Vol. 13, No. 5, 2005, pp. 762 -775. [9] S. Smith, The Scientist and Engineer's Guide to Digital Signal Processing, California Technical Publishing, USA, 1997. [10] K. Yen and Y. Zhao, “Co-Channel Speech Separation for Robust Automatic Speech Recognition: Stability and Efficiency”, Backman Institute and Department of Electrical and Computer Engineering, University of Illinois, USA, 1997. [11] Andrea Audio Test Labs White Paper “Andrea Superbeam® Array Microphone Speech Recognition Performance Using Microsoft Office XP”, New York, USA, 2002, viewed on 21 September 2006,
6th IEEE/ACIS International Conference on Computer and Information Science (ICIS 2007) 0-7695-2841-4/07 $25.00 © 2007