LOW-BIT-RATE SPEECH CODING Alan McCree ... - Semantic Scholar

Report 9 Downloads 219 Views
Springer Handbook on Speech Processing and Speech Communication

1

LOW-BIT-RATE SPEECH CODING Alan McCree Information Systems Technology Group MIT Lincoln Laboratory 244 Wood Street, Lexington, MA 02420, USA ABSTRACT Low-bit-rate speech coding, at rates below 4 kb/s, is needed for both communication and voice storage applications. At such low rates, full encoding of the speech waveform is not possible; therefore, low-rate coders rely instead on parametric models to represent only the most perceptually-relevant aspects of speech. While there are a number of different approaches for this modeling, all can be related to the basic linear model of speech production, where an excitation signal drives a vocal tract filter. The basic properties of the speech signal and of human speech perception can explain the principles of parametric speech coding as applied in early vocoders. Current speech modeling approaches, such as mixed excitation linear prediction, sinusoidal coding, and waveform interpolation, use more sophisticated versions of these same concepts. Modern techniques for encoding the model parameters, in particular using the theory of vector quantization, allow the encoding of the model information with very few bits per speech frame. Successful standardization of low-rate coders has enabled their widespread use for both military and satellite communications, at rates from 4 kb/s all the way down to 600 b/s. However, the goal of tollquality low-rate coding continues to provide a research challenge.

This work was sponsored by the Defense Advanced Research Projects Agency under Air Force Contract FA8721-05-C-0002. Opinions, interpretations, conclusions, and recommendations are those of the authors and are not necessarily endorsed by the United States Government.

1. INTRODUCTION As digital computers and communication systems continue to spread through our modern society, the use of digitized speech signals is increasingly common. The large number of bits required for accurate reproduction of the speech waveform makes many of these systems complex and expensive, so more efficient encoding of speech signals is desirable. For example, limited radio bandwidth is a major constraint in design of the next generation of public mobile telephone systems, and the speech data rate directly influences the bandwidth requirement. In military tactical communications, a system with a lower speech data rate can use less transmitter power to make detection more difficult, or it can allow higher signal to noise ratios to improve performance in a hostile jamming environment. Also, computer storage of speech, such as in voice mail or voice response systems, becomes cheaper if the number of bits required for speech storage can be reduced. These are just some of the applications which can benefit from the development of algorithms to significantly reduce the speech data rate. At bit rates above 4 kb/s, speech-specific waveform coders based on code excited linear prediction (CELP) [1] can produce good quality speech. But at lower rates, it becomes very difficult to encode all of the information in the speech signal. Therefore, most low rate coders model only the key perceptual features of speech, rather than the entire waveform. Since this is typically done by encoding the parameters of a linear model for speech, these are called parametric speech coders. In Section 2, this presentation of low-rate speech coding begins with a review of the basics of human

Springer Handbook on Speech Processing and Speech Communication

speech production and perception, and then introduces the linear model of speech as the basis for parametric speech coding. Section 3 then presents modern low-rate speech coding models using mixed excitation linear prediction, sinusoidal coding, and waveform interpolation. After a discussion of techniques for quantizing the model parameters in Section 4, complete low-rate coder designs that have been standardized for use in a range of applications are presented in Section 5, followed by conclusions and a summary in Section 6. 2. FUNDAMENTALS: PARAMETERIC MODELING OF SPEECH SIGNALS For parametric coding, it is important to understand the properties of speech signals and which of their characteristics need to be preserved to provide high quality to a human listener. Therefore, in this section, we review human speech production and perception and then introduce the classical vocoder algorithms. 2.1. Speech Production Speech is a sequence of sounds generated by the human vocal system [2, 3]. The acoustic energy necessary to produce speech is generated by exhaling air from the lungs. This air stream is used to produce sound in two different ways: by vibrating the vocal cords or by forcing air turbulence. If the vocal cords are used, the speech is referred to as voiced speech; otherwise, the speech is called unvoiced. In voiced speech, the opening and closing of the vocal cords at the glottis produces quasi-periodic puffs of air called glottal pulses, which excite the acoustic tubes of the vocal and nasal tracts. The average spacing between glottal pulses is called the pitch period. The frequency content of the resulting acoustic wave propagating from the mouth depends on the spectrum of the glottal pulses and on the configuration of the vocal tract. Typically, glottal pulses are roughly triangular in shape with a sharp discontinuity at the closure of the glottis, as shown in Figure 1. As this stylized example illustrates, glottal pulses have most of their energy concentrated at lower frequencies. Unvoiced sounds result from turbulent noise exciting the vocal and nasal tracts. The turbulence can come from simply breathing air out of the lungs in

2

a process called aspiration. This is the excitation source for whispered speech and for the /h/ sound. Turbulence can also be generated by forcing the air through a constriction in the vocal tract formed with the tongue or lips. This process, called frication, is used to make sounds such as /sh/. Figures 2–4 show time waveforms and Fourier spectra for some typical speech sounds. Figure 2 shows the waveform and spectrum from a sustained vowel. Since this vowel was generated by periodic glottal pulses, the spectrum consists of harmonics of the pitch fundamental, in this case 110 Hz. The spectral peaks at 600, 1500, 2500, and 3600 Hz represent the formant frequencies in this example. The relatively high amplitudes of the first few harmonics are a result of the glottal pulse excitation. Figure 3 shows the waveform and spectrum of a sustained fricative. This sound is mainly high-frequency turbulent noise. An example of a plosive sound is shown in Figure 4. Since this is an unvoiced plosive, the signal consists of silence during the vocal tract closure followed by a broad-band noise burst. The basic features of the speech production process can be captured in a simple time-varying linear model, as shown in Figure 5. In this model, glottal pulses or random noise pass through a linear filter representing the vocal tract frequency response. The parameters describing the excitation and vocal tract change as a function of time, tracking changes in the speech waveform. Key parameters include the voiced/unvoiced decision, pitch period, glottal pulse shape, vocal tract filter coefficients, and power level. This model can directly mimic stationary sounds, whether voiced or unvoiced, by appropriate choice of excitation signal and vocal tract frequency response. Also, it can approximate non-stationary sounds if the model parameters are changed as rapidly as in natural speech.

2.2. Human Speech Perception Human perception of speech is determined by the capabilities of the human auditory system, which consists of the ear, the auditory nerve, and the brain [4]. The ear serves as a transducer, converting the acoustic input at the outer ear first to bone vibrations in the middle ear, then to fluid motion in the cochlea of the inner ear, and finally to electrical pulses gener-

Springer Handbook on Speech Processing and Speech Communication

ated by the inner hair cells in the cochlea. The location of maximum fluid vibration in the cochlea varies systematically with the input signal frequency, and this frequency response varies with the strength of the input signal. Also, the inner hair cells only detect motion in one direction. Thus, the ear acts like a very large bank of bandpass filters with dynamic range compression, and the output of each filter undergoes half-wave rectification. The half-wave rectified bandpass filter outputs are transmitted across the auditory nerve to the lower levels of the brain, where specialized neurons can perform basic signal processing operations. For example, there are neurons that respond to onsets and to decays, and other neurons may be able to estimate autocorrelations by comparing a signal to a delayed version of itself. The outputs of these neurons are then passed to higher levels of the brain for more sophisticated processing. This results in the final analysis of the acoustic signal based on context and additional knowledge, such as classification of sound source and interpretation of pattern. In speech processing, this includes recognition of words as well as analysis of speaker identity and emotional state. For a typical speech signal such as the vowel shown in Figure 2, a human listener should be able to distinguish a number of features. A comparison of the average power in each bandpass filter gives an overall estimate of the power spectrum of the speech signal. This spectral analysis will also yield the exact frequency of the lower pitch harmonics. However, human listeners cannot distinguish individual frequencies present in a twelve tone signal unless the tones are separated by more than a measure of bandwidth called the critical band [5], and these bandwidths increase with frequency as shown in Figure 6. Therefore, for a typical male pitch frequency of 100 Hz, individual pitch harmonics will not be distinguished above about 500 Hz. Higher frequency harmonics near the formant frequencies may also be detected by this spectral analysis if they are much stronger than their immediate neighbors. At higher frequencies, individual pitch harmonics cannot be resolved since each bandpass filter contains multiple pitch harmonics. However, there is a great deal of information in the time variation of the higher frequency bandpass filter outputs. An example of a bandpass filter output centered at 2500 Hz from a

3

sustained vowel is shown in Figure 7. Notice that the individual pitch pulses can be seen in the rise and fall of this bandpass signal. It is reasonable to presume that the brain can detect pitch pulses as abrupt power transitions in the bandpass signals, and may also estimate pitch frequency from the average pulse spacing and voicing strength from the abruptness of the pulse onset. In fact, an experiment has shown that higher frequency onset neurons in the cat respond with every pitch pulse in a sustained synthetic speech vowel [6]. Overall, the combination of spectral and bandpass waveform analysis allows perception of many aspects of speech signals. For sustained or continuant sounds, the human listener should be able to estimate the spectral envelope, gain, harmonic content at lower frequencies, and pulse characteristics at higher frequencies. For noncontinuants, the human auditory system can detect rapid transitions in these characteristics, especially at high frequencies. These are the characteristics of the speech signal that we would expect low-rate coders to be able to reproduce.

2.3. Vocoders Knowledge of speech production and speech perception can be combined to give insight into efficient transmission and storage of speech signals. The ensemble of possible speech waveforms is fundamentally limited by the capabilities of the speech production process, and not all details of these waveforms are perceptually important. One particularly efficient method of speech compression is to analyze the parameters of the linear speech model, transmit these parameters across the communication channel, and synthesize a reproduction of the speech signal with a linear model. Speech coders based on this approach are called vocoders or parametric coders. If the parameters are updated from 20 to 100 times per second, the synthesizer can reproduce much of the information of the speech signal using significantly fewer bits than direct waveform coding. The first vocoder was developed by Homer Dudley in the 1930’s [7]. In this method, the input speech is divided into ten frequency channels, and the spectral envelope information representing the vocal tract is transmitted as the average power in each channel. The synthesizer applies either a pulse train or random noise to an identical bank of bandpass filters. This

Springer Handbook on Speech Processing and Speech Communication

approach, now called a channel vocoder, can produce speech which is intelligible but not of very high quality. Because it uses only a fairly small number of frequency values to represent the spectral envelope, the channel vocoder cannot accurately reproduce the perceptually important formant structure of voiced speech. A better approach, called the formant vocoder [8, 9], transmits information about the formant frequencies directly and synthesizes speech with a small number of damped resonators or poles. The poles can be connected either in parallel or in cascade. Since they better model the spectral envelope of voiced speech, formant vocoders can produce speech of higher quality than channel vocoders. Unfortunately, the formants are difficult to reliably estimate in the analysis procedure, so formant vocoders may make annoying errors in tracking the formant frequencies. Formant synthesizers are more often used in text-to-speech applications where the formant frequencies can be generated by fixed rules. The difficulty of formant tracking can be avoided if more poles are used to model the entire speech spectrum, without explicit identification of formant frequencies. In this way, the formants will be automatically captured when they dominate the spectrum, but more subtle spectral features due to glottal pulse shape or nasal zeros can still be approximately reproduced. This principle is used in vocoders based on linear predictive coding (LPC), first introduced around 1970 [10, 11]. The term linear prediction is used because the all-pole coefficients are derived from the best predictor of the current input speech sample from a linear combination of previous samples: sˆ(n) =

P X

4

track changes in the input speech. Since the basic LPC vocoder does not produce high quality speech, there has been significant effort aimed at improving the standard model. One well-known problem with vocoder speech output is a strong buzzy quality. Formant synthesizers use a mixed excitation with simultaneous pulse and noise components to address this problem [8, 9], an idea which has also been used in channel vocoders [12] and LPC vocoders [13, 14, 15]. Most of these early mixed excitation systems used a lowpass-filtered pulse combined with highpass-filtered noise, but a multiband mixture algorithm with separate voicing decisions in each of three bands was used in [12]. Other attempts at vocoder improvement included using more realistic excitation pulses [16] and pitchsynchronous LPC analysis [17]. Atal and David proposed encoding the Fourier coefficients of the LPC excitation signal [18]. Mathematically, this can be written as: x(n) =

N −1 1 X X(k) exp(jωk n), N

(2)

k=0

where x(n) is the excitation signal, N is the pitch period, k is a frequency index, X(k) is a complex th amplitude, and ωk = 2πk N is the frequency of the k harmonic. Since x(n) is real, this can equivalently be written as: N/2

x(n) =

X

[Ac (k) cos(ωk n) + As (k) sin(ωk n)] ,

k=0

(3) where Ac (k) and As (k) are real amplitudes, or as: N/2

am s(n − m),

(1)

m=1

where am are the set of P predictor coefficients. Minimum mean square error estimation results in a highly efficient time-domain algorithm to generate these filter coefficients [3]. A block diagram of the LPC synthesizer is shown in Figure 8. Either a periodic impulse train or white noise is used to excite an all-pole filter, and the output is scaled to match the level of the input speech. The voicing decision, pitch period, filter coefficients, and gain are updated for every block of input speech (called a speech frame) to

x(n) =

X

A(k) cos(ωk n + φk ),

(4)

k=0

where A(k) and φk represent the magnitudes and phases of the harmonics. These coefficients were analyzed with either a pitch-synchronous Fourier series analysis or a frame-based Fourier transform, and the excitation was synthesized as a sum of harmonic sine waves. While these techniques provided some improvement, as of the early 1980’s these enhanced LPC vocoders were still unable to provide reliable communications in all environments. However, they did

Springer Handbook on Speech Processing and Speech Communication

provide the background for newer approaches that have provided significant improvements in speech quality at low rates. The more recent technical progress enabling this success has come in two areas: flexible parametric models and efficient quantization techniques. These are the subjects of the next two sections. 3. FLEXIBLE PARAMETRIC MODELS Three more recent speech models have been shown to provide significant performance improvements over the traditional LPC vocoder model: mixed excitation linear prediction, sinusoidal coding, and waveform interpolation. In this section, we review each of these approaches.

5

spectrum. Since only two filters are required regardless of the number of frequency bands, this structure is more computationally efficient than using a bank of bandpass filters [19]. To make full use of this mixed excitation synthesizer, the desired mixture spectral shaping must be accurately estimated for each frame. In MELP, the relative pulse and noise power in each frequency band is determined by an estimate of the voicing strength at that frequency in the input speech. An algorithm to estimate these voicing strengths combines two methods of analysis of the bandpass filtered input speech. First, the periodicity in each band is estimated using the strength of normalized correlation coefficients around the pitch lag, where the correlation coefficient at delay t is defined by PN −1

3.1. Mixed Excitation Linear Prediction (MELP) The primary feature of the MELP model, shown in Figure 9, is the use of mixed excitation for more robust LPC synthesis [19]. Other key features are the use of aperiodic pulses, adaptive spectral enhancement to match formant waveforms, and a pulse dispersion filter to better match natural excitation waveform characteristics. Finally, MELP also includes representation of the Fourier magnitudes of the excitation signal. 3.1.1. Mixed Excitation MELP generates an excitation signal with different mixtures of pulse and noise in each of a number (typically five) of frequency bands [20]. As shown in Figure 9, the pulse train and noise sequence are each passed through time-varying spectral shaping filters and then added together to give a full-band excitation. For each frame, the frequency shaping filter coefficients are generated by a weighted sum of fixed bandpass filters. The pulse filter is calculated as the sum of each of the bandpass filters weighted by the voicing strength in that band. The noise filter is generated by a similar weighted sum, with weights set to keep the total pulse and noise power constant in each frequency band. These two frequency shaping filters combine to give a spectrally flat excitation signal with a staircase approximation to any desired noise

n=0 s(n)s(n + t) c(t) = qP . PN −1 2 N −1 2 s (n + t) s (n) n=0 n=0

(5)

This technique works well for stationary speech, but the correlation values can be too low in regions of varying pitch. The problem is worst at higher frequencies, and results in a slightly whispered quality to the synthetic speech. The second method is similar to time domain analysis of a wideband spectrogram. The envelopes of the bandpass filtered speech are generated by full wave rectification and smoothing with a one-pole lowpass filter, while using a firstorder notch filter to remove the DC term from the output. At higher frequencies, these envelopes can be seen to rise and fall with each pitch pulse, just as in the spectrogram display. Autocorrelation analysis of these bandpass filter envelopes yields an estimate of the amount of pitch periodicity. The overall voicing strength in each frequency band is chosen as the largest of the correlation of the bandpass filtered input speech and the correlation of the envelope of the bandpass filtered speech. 3.1.2. Aperiodic Pulses This mixed excitation can remove the buzzy quality from the speech output, but another distortion is sometimes apparent. This is the presence of short isolated tones in the synthesized speech, especially for female speakers. The tones can be eliminated by adding noise in the lower frequencies, but so much

Springer Handbook on Speech Processing and Speech Communication

noise is required that the output speech sounds rough and noisy. A more effective solution is to destroy the periodicity in the voiced excitation by varying each pitch period length with a pulse position jitter uniformly distributed up to ±25%. This allows the synthesizer to mimic the erratic glottal pulses that are often encountered in voicing transitions or in vocal fry [21]. This cannot be done for strongly voiced frames without introducing a hoarse quality, however, so a control algorithm is needed to determine when the jitter should be added. Therefore, MELP adds a third voicing state to the voicing decision which is made at the transmitter [22]. The input speech is now classified as either voiced, jittery voiced, or unvoiced. In both voiced states, the synthesizer uses a mixed pulse/noise excitation, but in the jittery voiced state the synthesizer uses aperiodic pulses as shown in Figure 9. This makes the problem of voicing detection easier, since strong voicing is defined by periodicity and is easily detected from the strength of the normalized correlation coefficient of the pitch search algorithm. Jittery voicing corresponds to erratic glottal pulses, so it can be detected by either marginal correlation or peakiness in the input speech. Peakiness p is defined by the ratio of the RMS power to the average value of the full-wave rectified LPC residual signal [23]: q P N −1 2 1 n=0 s (n) N . (6) p = 1 PN −1 n=0 |s(n)| N This peakiness detector will detect unvoiced plosives as well as jittery voicing, but this is not a problem since the use of randomly spaced pulses has previously been suggested to improve the synthesis for plosives [15]. 3.1.3. Adaptive Spectral Enhancement The third feature in the MELP model is adaptive spectral enhancement. This adaptive filter helps the bandpass filtered synthetic speech to match natural speech waveforms in the formant regions. Typical formant resonances usually do not completely decay in the time between pitch pulses in either natural or synthetic speech, but the synthetic speech waveforms reach a lower valley between the peaks. This is probably caused by the inability of the poles in the LPC

6

synthesis filter to reproduce the features of formant resonances in natural human speech. Here, there are two possible causes. First, the problem could be simply due to improper LPC pole bandwidth. The synthetic time signal may decay too quickly because the LPC pole has a weaker resonance than the true formant. Another possible explanation is that the true formant bandwidth may vary somewhat within the pitch period [8], and the synthetic speech cannot mimic this behavior. Informal experiments have shown that a time-varying LPC synthesis pole bandwidth can improve speech quality by modeling this effect, but it can be difficult to control [19]. The adaptive spectral enhancement filter provides a simpler solution to the problem of matching formant waveforms. This adaptive pole/zero filter was originally developed to reduce quantization noise between the formant frequencies in CELP coders [24]. The poles are generated by a bandwidth-expanded version of the LPC synthesis filter, where each z −1 term in the z-transform of the LPC filter is replaced by αz −1 , with α equal to 0.8. Since this all-pole filter introduces a disturbing lowpass filtering effect by increasing the spectral tilt, a weaker all-zero filter calculated with α equal to 0.5 is used to decrease the tilt of the overall filter without reducing the formant enhancement. In addition, a simple first-order FIR filter is used to further reduce the lowpass muffling effect [25]. In the MELP coder, reducing quantization noise is not a concern, but the time-domain properties of this filter produce an effect similar to pitchsynchronous pole bandwidth modulation. As shown in Figure 10, a simple decaying resonance has a less abrupt time domain attack when this enhancement filter is applied. This feature allows the speech output to better match the bandpass waveform properties of natural speech in formant regions, and it increases the perceived quality of the synthetic speech.

3.1.4. Pulse Dispersion Filter The pulse dispersion filter shown in Figure 9 improves the match of bandpass filtered synthetic and natural speech waveforms in frequency bands that do not contain a formant resonance. At these frequencies, the synthetic speech often decays to a very small value between the pitch pulses. This is also

Springer Handbook on Speech Processing and Speech Communication

true for frequencies near the higher formants, since these resonances decay significantly between excitation points, especially for the longer pitch periods of male speakers. In these cases, the bandpass filtered natural speech has a smaller peak-to-valley ratio than the synthetic speech. In natural speech, the excitation may not all be concentrated at the point in time corresponding to closure of the glottis [26]. This additional excitation prevents the natural bandpass envelope from falling as low as the synthetic version. This could be due to a secondary excitation peak from the opening of the glottis, aspiration noise resulting from incomplete glottal closure, or a small amount of acoustic background noise which is visible in between the excitation peaks. In all of these cases, there is a greater difference between the peak and valley levels of the bandpass filtered waveform envelopes for the LPC speech than for natural human speech. The pulse dispersion filter is a fixed FIR filter, based on a spectrally flattened synthetic glottal pulse, which introduces time-domain spread to the synthetic speech. In [19], a fixed triangle pulse [16, 27] based on a typical male pitch period is used, with the lowpass character removed from its frequency response. The filter coefficients are generated by taking a discrete Fourier transform (DFT) of the triangle pulse, setting the magnitudes to unity, and taking the inverse DFT. The dispersion filter is applied to the entire synthetic speech signal to avoid introducing delay to the excitation signal prior to synthesis. Figure 11 shows some properties of this triangle pulse and the resulting dispersion filter. The pulse has considerable time-domain spread, as well as some fine detail in its Fourier magnitude spectrum. Using this dispersion filter decreases the synthetic bandpass waveform peakiness in frequencies away from the formants and results in more natural sounding LPC speech output.

3.1.5. Fourier Series Modeling The final feature of the MELP model is a Fourier series expansion of the voiced excitation signal. Implementing a Fourier series expansion in MELP synthesis is straightforward, using the pitch-synchronous Fourier series synthesis of (4). Instead of using a simple time domain digital impulse, each pitch pulse of the mixed excitation signal shown in Figure 9 is generated by an inverse DFT of exactly one period in

7

length. If the magnitude values of this DFT are all equal to one and the phases are all set to zero, this is simply an alternate representation for a digital impulse. As in [18], these magnitudes can be estimated using either a pitch-synchronous Fourier series analysis or a longer window DFT. 3.1.6. MELP Model Improvements A number of refinements to the basic MELP model have been demonstrated. Replacement of the frame-based autocorrelation pitch algorithm with a subframe-based approach that allows time-varying pitch within an analysis frame was shown to improve the reliability of pitch estimation in [28]. The use of a sliding pitch window, a specially-designed plosive speech analysis and synthesis algorithm, and an excitation magnitude postfilter combined to improve performance in [29]. A wideband MELP coder, extended to 8 kHz bandwidth, provided improved quality in [30]. An intelligibility enhancement preprocessor that emphasizes the perceptually important formant transitions by combining adaptive spectral enhancement with variable-rate time-scale modification improved intelligibility in [31]. Finally, a number of enhancements, including subframe pitch estimation, pitch-synchronous LPC and Fourier series analysis, and oversampled Fourier synthesis for fractional pitch values, showed significant performance improvement for female speakers [32]. 3.2. Sinusoidal Coding Another successful approach to low rate speech coding is based on modeling speech as a sum of sine waves [33, 34]. Initial encouraging work in this direction [35, 36] led to the development of two successful speech coding techniques: the sinusoidal transform coder (STC) [37] and the multiband excited (MBE) vocoder [38]. 3.2.1. The Sinusoidal Model of Speech Many signals can be locally modeled as a sum of sine waves: s(n) =

L X l=0

A(l) cos(ωl n + φl ),

(7)

Springer Handbook on Speech Processing and Speech Communication

where A(l), ωl , and φl are the magnitude, frequency, and phase of the lth sine wave. Note that, while this equation is similar to (4), in this case we are not constraining the frequencies to be harmonically related, and we are analyzing the speech signal itself rather than the excitation signal. While it is clear that such a model is appropriate for quasiperiodic voiced speech, it can be shown that this representation can also provide a perceptually sufficient representation of unvoiced speech as long as the sinusoids are closely spaced in frequency [37]. The principle of sinusoidal speech coding is to estimate and transmit the sine wave parameters for each speech frame, and then to synthesize speech using these parameters. 3.2.2. Sinusoidal Parameter Estimation The general problem of estimating the time-varying sine wave parameters such that the reconstructed speech signal is as close as possible to the original speech is difficult to solve analytically. However, if a speech signal is locally periodic, then minimizing the squared estimation error over a window spanning a number of consecutive pitch periods produces a straightforward solution: the optimal estimates of the complex amplitudes are given by the DFT values at the harmonics of the pitch fundamental frequency [33]. Also, since the non-harmonic DFT coefficients will be zero in this case, the sine wave frequencies correspond to the peaks of this spectral estimate. Based on these insights, a more general sinusoidal analysis method has been developed and shown to be very effective in STC [37]: • window the input speech signal, for example with a Hamming window of duration approximately 2-3 pitch periods • compute the DFT • find the sine wave frequencies as the locations of the peaks of the DFT magnitude • estimate the magnitude and phase of each sinusoid based on the corresponding complex DFT coefficients. An alternative approach, used in the MBE vocoder, is to assume the speech signal is locally periodic, but to explicitly model the DFT distortion due

8

to windowing as a frequency-domain convolution of the desired sine wave coefficients with the DFT of the window function [38]. Then the sinusoid frequencies are the harmonics of the pitch fundamental, and the minimum squared error estimates of the amplitudes and phases come from the speech spectrum at frequencies around the harmonic along with the spectrum of the window function. Finally, the sinewave parameters can be estimated using analysis-by-synthesis, either in the frequency or time domain [39, 40, 41]. 3.2.3. Synthesis At first glance, it would seem that sinusoidal synthesis using (7) is straightforward. However, each frame of speech has different sine wave parameters, so the synthesizer must perform smoothing to avoid discontinuities at frame boundaries. One form of parameter interpolation involves matching sinusoids from one frame to the next and then continuously interpolating their amplitudes, phases, and frequencies [37]. For well-behaved signals this matching process is not difficult, but a general solution typically involves an additional algorithm for birth and death of sine wave tracks to handle difficult cases. Phase interpolation also presents difficulties, since the phase values are only relative within a period, and the number of periods from one frame to the next must also be estimated. An elegant solution is to assume the phase trajectory can be modeled with a cubic polynomial as a function of time; this is the simplest solution meeting endpoint constraints for both frequency and phase. A less-complex synthesis technique is the overlap/add method. Each frame is synthesized once with the constant parameters from the previous frame, and again with the new parameter set. The two synthesized signals are appropriately weighted by windowing functions and then summed together to produce the final synthesis output. These windows are designed such that the previous parameter synthesis tapers down to zero, the new parameter synthesis tapers up from zero, and the sum of the windows is one for every sample. Despite performing synthesis more than once for at least some of the speech samples, overlap/add synthesis can reduce complexity since there is no need to compute interpolated val-

Springer Handbook on Speech Processing and Speech Communication

ues of the parameters for every sample. In addition, it is readily amenable to the use of efficient fast Fourier transform methods.

3.2.4. Low Rate Coding Constraints The general sinusoidal model has been used successfully in many applications. Unfortunately, for low rate speech coding, the complete set of sine wave amplitudes, phases, and frequencies typically contains too much information for efficient quantization. Therefore, the linear speech model is used to impose constraints on the sine wave parameters. First, the voiced speech frequencies are assumed to be harmonics of the fundamental frequency. This eliminates the need to estimate and transmit the exact frequency of each sinusoid, and incorporating this constraint into the sine wave analysis procedure naturally leads to integrated frequency-domain pitch and voicing estimation procedures. In both STC and MBE, pitch estimation is performed by measuring the optimal harmonic sinusoidal fit over a range of possible fundamental frequencies, and then selecting the pitch value as the one providing the best performance by a distance metric. The linear speech model also necessitates the explicit modeling of unvoiced speech. Voicing decisions can be made based on the fit of the harmonic sinusoidal model to the original speech, since voiced speech frames should produce a better model fit. Typically better performance is obtained with a “soft” voicing decision, where partial voicing is allowed with both voiced and unvoiced components. As in earlier mixed excitation work, this is done either using a cutoff frequency, where frequencies below the cutoff are voiced and those above are unvoiced [37], or with separate decisions for a number of frequency bands [38]. The unvoiced component can be synthesized either with sinusoidal synthesis with randomized phases, or with white noise generation. Since encoding of the phases can require a significant number of bits, low rate sinusoidal coders use a parametric model for the phase as well. Since the voiced excitation signal should look like a pulse, it can be assumed to be zero phase with a flat magnitude. In this case, the overall magnitude and phase responses will come from the vocal tract filter.

9

In addition, the vocal tract can be modeled parametrically. In particular, LPC coefficents can be fit to the sine wave amplitudes. This can be done with a frequency domain algorithm, which can also incorporate a mel-scale frequency warping to improve perceptual modeling [42, 43], or with a time-domain LPC analysis [44]. The resulting model will have a minimum phase output corresponding to this all-pole filter. Alternatively, the sinusoidal amplitudes can be quantized using non-parametric techniques such adaptive transform coding [38] or direct vector quantization [45, 46, 47]. 3.2.5. Spectral Postfiltering Since the synthetic speech from sinusoidal coders sometimes exhibits a muffling effect, a postfilter can be applied to deepen the spectral valleys between formant frequencies [33]. Motivated by the timedomain LPC postfilter of [24], a frequency-domain postfilter design method has been developed [43]. For each frequency, these postfilter weights correspond to the unity-gain spectrally-flattened sinewave amplitudes raised to a power less than one for spectral compression. This postfiltering approach is used in both STC and MBE coders to improve speech quality. 3.3. Waveform Interpolation For voiced speech, it is natural to think of the excitation signal as a sequence of pitch periods or cycles, with the waveform for each cycle representing the ideal excitation for the vocal tract filter. From one pitch period to the next, these excitation waveforms will be similar but not identical, since the speech content is gradually changing. Therefore, we can extract these pitch cycle excitation waveforms at a slower rate, quantize them for transmission, and reconstruct the missing waveforms at the receiver. This is the principle behind another successful low rate speech technique called waveform interpolation (WI) [48]. 3.3.1. Analysis The fundamental operation of WI is the extraction of the excitation pitch cycle waveform, called the characteristic waveform [49]. Each waveform is extracted

Springer Handbook on Speech Processing and Speech Communication

as one period of the signal, so minimizing edge effects is important for accurate representation. Therefore, this analysis is performed on the LPC residual signal; in addition, some local adjustment of the window position is allowed so as to minimize signal power near the boundaries. To ensure that the characteristic waveforms evolve smoothly in time, an alignment process is performed, in which each extracted waveform is circularly shifted to maximize its correlation with the previous one. The result of this alignment process is a sequence of time-aligned characteristic waveforms, each with similar shape. 3.3.2. Synthesis Synthesis in WI coders is typically done using harmonic sine wave synthesis of the excitation signal followed by LPC synthesis. The fundamental frequency and the complex sine wave coefficients are interpolated for every sample, while the LPC coefficients are interpolated on a subframe basis. However, it is sometimes more convenient to perform the LPC synthesis filtering in the frequency domain prior to the sine wave synthesis [48]. Note that since the alignment phase (the circular time shift of each characteristic waveform) is not preserved, the precise time offset from the beginning to the end of a frame of synthesized speech is not explicitly controlled and will depend upon the interpolated pitch contour for the frame. Therefore, WI coders are typically not time-synchronized with the input speech signal; instead they exhibit a slow time drifting which is not perceptually relevant. 3.3.3. Low Rate WI Speech Models At low bit rates, WI coders also exploit the linear speech model. Besides the concepts of pitch and vocal tract filtering mentioned in the basic WI formulation, additional bit savings can be introduced by using a soft voicing concept and by using a parametric model for system phase. In WI, voicing is represented by the decomposition of the characteristic waveform into two additive components: the slowly evolving waveform (SEW) and the rapidly evolving waveform (REW) [50]. The underlying idea is that the quasi-periodic voiced

10

component changes slowly over time, while the unvoiced noisy component changes quickly. Analysis of the SEW/REW decomposition is done by lowpass and highpass filtering the aligned characteristic waveforms over time, typically with a cutoff frequency of around 20 Hz. The SEW can then be carefully quantized at an update rate of 40 Hz, while the REW can be replaced by a noise signal with similar spectral characteristics (typically by randomizing the phase). Since the full SEW waveform contains a great deal of information, at low rates it is typically quantized by coding only the lower harmonic amplitudes, setting the other amplitudes to unity, and setting the phases to zero so that the LPC filter will provide an overall minimum phase response. For the REW, only a crude amplitude envelope is necessary, but it should be updated quite frequently. 3.3.4. High-Rate Improvements More recent work has improved the performance of WI modeling at higher bit rates. A WI method with asymptotically perfect reconstruction properties was presented in [51], enabling very high quality output from the WI analysis/synthesis system. Similarly, a scalable WI coder using pitch-synchronous wavelets allows a wide range of bit rates to utilized [52]. In this coder, higher bit rates achieve better speech quality by encoding higher resolutions of the wavelet decomposition. 3.4. Comparison and Contrast of Modeling Approaches All of the low rate coding models described in this section can be viewed as representing the linear speech model of Figure 5, where the parameters for transmission for each frame of speech include the pitch period (or frequency), gain, voicing information, and vocal tract filter. Common elements of most MELP, sinusoidal, and WI coders are: • vocal tract filter modeling with LPC • pitch extraction algorithm for excitation analysis • incorporation of excitation Fourier amplitude spectrum

Springer Handbook on Speech Processing and Speech Communication

• zero-phase excitation • soft voicing decision allowing mixed excitation synthesis • spectral enhancement filtering to sharpen formant frequencies. However, each coder has its own strengths since it is based on a different perspective of the speech signal. The MELP model extends the traditional vocoder view of the excitation as a sequence of pulses, so it is able to exploit pitch pulse characteristics such as waveform shape. By contrast, sinusoidal modeling is based on a Fourier transform view of the speech signal, allowing straightforward frame-based processing. Finally, WI views the excitation signal as an evolving Fourier series expansion, allowing smooth analysis and synthesis of a signal with changing characteristics including pitch. 4. EFFICIENT QUANTIZATION OF MODEL PARAMETERS For low rate speech coding, a flexible parametric model provides a necessary first step, but efficient quantization is critical to an overall coder design. In this section, we review two very important quantization topics: vector quantization and LPC coefficient quantization.

11

based on Shannon’s work on block source coding and rate-distortion theory [55]. Since the complexity of vector quantization can become prohibitive when the number of codevectors is large, specially-structured codebooks are often used. Theoretically, such approaches will lose coding efficiency as compared to a full VQ, but they can have significant advantages in codebook storage size and search complexity. In one common approach, called split VQ, the vector is divided into subvectors, and each of these is vector quantized. As the number of subvectors becomes large, split VQ becomes scalar quantization, so this approach provides a straightforward way to adjust coding performance vs. complexity. Another approach, multistage VQ (MSVQ), maintains full dimensionality but reconstructs a quantized vector as the sum of codevector from a number of different codebooks, each of smaller size. A simple MSVQ search algorithm, called sequential search, processes the codebooks one stage at at time. The input vector is quantized using the first codebook, then the error from the this first stage is quantized using the second codebook, and this process continues until all of the codebooks. However, much better performance is typically achieved with a joint search MSVQ, where the best joint set of codevectors over all stages is selected.

4.2. Exploiting Temporal Properties 4.1. Vector Quantization The traditional approach to quantization of speech parameters, such as for example harmonic amplitudes, is to encode each one separately. In contrast to this scalar quantization approach, significant performance improvement can be attained by encoding multiple values simultaneously, by searching a codebook of possible vectors. Since both encoder and decoder have copies of the same codebook, the only information to be transmitted is the index of the selected codevector. This process, called vector quantization (VQ) [53], has a long history in low rate speech coding. The earliest applications, in pattern-matching vocoders of the 1950’s, were based on tables designed for specific speech phonemes [54]. Starting around the same time, a more rigorous information-theoretic approach was developing

Further coding efficiency can be attained by exploiting the slowly-varying characteristics of speech over time. One simple approach uses predictive VQ, where the parameter values for a current frame are first predicted from past quantized values, and only the error between this predicted value and the true input is quantized. Typically a linear combination of past values is used, resulting in vector linear prediction equations similar to the traditional linear prediction of (1), but with prediction across frames rather than samples. These prediction coefficients can be designed off-line based on training data; alternatively in switched predictive VQ the prediction coefficients can be adjusted and transmitted for each frame based on the speech characteristics or to optimize quantizer performance. By using simple models of speech evolution, predictive quantization can provide significant

Springer Handbook on Speech Processing and Speech Communication

bit-rate reduction without introducing additional delay. More general coding schemes quantize a sequence of speech frames together. The advantage of joint encoding is that redundancies between neighboring frames can be fully exploited, but the disadvantage can be a significant increase in delay. A fixed block of parameter vectors, often called a superframe, can be encoded simultaneously. The spectral vectors can be quantized directly using matrix quantization, where each codebook entry corresponds to a sequence of vectors. Alternatively, adaptive interpolation can be used to transmit only a subset of frames with each block, replacing non-transmitted frames using interpolation. More flexibility can be introduced by replacing fixed blocks with variablelength segments. In particular, in a phonetic vocoder, each segment corresponds to a speech phoneme [56]. In this case, the only information transmitted is what was said (the phonetic content) and the way it was said (prosody).

4.3. LPC Filter Quantization In a low-rate coder utilizing linear prediction, the LPC filter coefficients must be quantized. It is well known that the LPC coefficients are not suitable for scalar quantization because they require a large number of bits to maintain an accurate representation and the quantized filter tends to be unstable. For this reason, early quantization work used alternative representations such as the reflection coefficients [3, 57]; however, almost all recent work on this topic uses the more powerful line spectral frequency (LSF) representation [58]. Originally developed as a speech synthesis structure, the LSF’s are generated by finding the roots of the two z-transform polynomials corresponding to the LPC coefficients with additional reflection coefficients of 1 or −1. The LSF’s have many useful properties for both quantization and synthesizer interpolation. They have a natural ordering, and a valid set of LSF’s guarantees a stable LPC filter. In addition, the frequency representation allows quantization to take advantage of known properties of human perception. The higher frequencies can be quantized less accurately than lower frequencies, and LSF’s corresponding to sharp LPC poles at the perceptually important

12

speech formants can be selected for more accurate representation. In modern low-rate coders, spectral quantization is typically done with some form of vector quantization of the LSF’s. A common performance measure is spectral distortion (SD) between unquantized and quantized LPC spectra, with a performance goal of average distortion of 1 dB while minimizing outlier frames [59]. However, there is evidence that a critical-band weighted version of SD is a more accurate predictor of perceived distortion [60]. Typically, more than 20 bits are required per frame to achieve this performance, and the complexity of full VQ with 220 codewords becomes too high for reasonable applications. Therefore, most low-rate coders use either split [59] or [61] multistage VQ. Finally, to implement VQ in the LSF domain, a weighted Euclidean LSF distance measure is needed. Commonlyused functions weight LSF’s in the vicinity of formants more strongly based on the LPC power spectrum [59], but the optimal LSF weighting to optimize SD performance has been derived [62]. This has also been modified to incorporate critical-band weighting [63]. 5. LOW-RATE SPEECH CODING STANDARDS In this section, we present a number of successful low-rate coder designs that have been standardized for communication applications. First, we describe low-rate coders that have been standardized for military communication applications: the U. S. MILSTD 3005 2.4 kb/s MELP and the NATO STANAG 4591 MELPe coding suite at 2400, 1200, and 600 b/s. Then, we present an overview of speech coding standards for satellite telephony based on MBE. Finally, we discuss more recent efforts by the ITU to standardize a toll-quality low-rate coder at 4 kb/s. 5.1. MIL-STD 3005 In the early 1993, the U. S. Government DoD Digital Voice Processing Consortium initiated a threeyear project to select a new 2.4 kb/s coder to replace the older LPC-10e standard [64] for secure communications. This selection process consisted of two parts: evaluation of six minimum performance re-

Springer Handbook on Speech Processing and Speech Communication

quirements and overall composite rating by figure of merit [65]. Intelligibility, quality, speaker recognizability, communicability, and complexity were all measured as part of this selection process. This project led to the selection in 1996 of a 2.4 kb/s MELP coder as the new MIL-STD 3005, as it had the highest figure of merit among the coders that passed all requirements. This coder uses a 22.5 ms speech frame, with the bit allocation shown in Table 1 [66, 67]. The spectral envelope is represented by a 10th order LPC, quantized using MSVQ of the LSF’s with four stages and an M-best search algorithm [61]. Additional spectral modeling is achieved using the first 10 residual Fourier harmonic magnitudes, which are estimated from the spectral peaks of a frame-based DFT and quantized with an 8-bit VQ. The gain is estimated twice per frame, and quantized with a joint 8-bit coding scheme. The logarithm of the pitch period is quantized with a scalar quantizer spanning lags from 20 to 160 samples, with a reserved all-zero code for unvoiced frames. Of the five bandpass voicing decisions, the lowest one is represented by the overall voiced/unvoiced decision, and the remaining four are encoded with one bit each. In intelligibility, quality, speaker recognizability, and communicability, performance of the 2.4 kb/s MELP standard was much better than the earlier 2.4 kb/s LPC-10e [68]. This coder also met the project goal of performance at least as good as the older 4.8 kb/s FS1016 CELP standard [69], at only half the bit rate. Interestingly, despite their differences in design, three other coders, based on STC, WI, and MBE, provided similar performance, with the first two of these also able to meet all six performance requirements.

5.2. The NATO STANAG 4591 These impressive results led the North American Treaty Organization (NATO) to undertake a similar standardization process in 1997, this time targetting the dual bit rates of 2.4 kb/s and 1.2 kb/s. Candidates from three NATO nations were evaluated using tests of speech quality, intelligibility, speaker recognition, and language dependency [70, 71, 72]. An enhanced MELP coder, called MELPe, was then selected as 2.4/1.2 kb/s NATO standard STANAG 4591 in 2001 [73].

13

This 2.4 kb/s MELPe coder is based on MILSTD 3005 MELP, using the same quantization techniques and bit allocation, and is therefore completely interoperable with it. However, performance has been improved, primarily by the addition of a robust noise suppression front-end based on minimum mean-square error of the log spectral amplitudes [74]. This 2.4 kb/s MELPe coder was extended to also operate at 1.2 kb/s, using a superframe of three consecutive 22.5 ms frames [75]. This approach allows the 1.2 kb/s MELPe coder to perform nearly as well as the 2.4 kb/s version, at the price of additional coding delay. In 1.2 kb/s MELPe, the pitch trajectory and voicing pattern are jointly vector quantized using 12 bits for each superframe. The LSF quantization depends upon the voicing pattern, but for the most challenging case of all voiced frames a forward/backward interpolation scheme is used. This algorithm has three components: the last frame is quantized with the single-frame 25-bit MSVQ from the 2.4 kb/s coder, the optimal interpolation patterns for the first two frames are vector quantized with four bits, and finally the remaining error of these two frames is jointly quantized with 14-bit MSVQ. The Fourier magnitudes for the last frame are quantized with the same 8-bit VQ as in 2.4 kb/s, and the remaining frame magnitudes are regenerated using interpolation. The six gain values are vector quantized with 10 bits, and the four bandpass voicing decisions for each voiced frame are quantized with a two-bit codebook. The aperiodic flag is quantized with one bit per superframe, with codebooks selected by the overall voicing pattern. In 2006, the NATO STANAG 4591 was extended to also operate at 600 b/s [76]. As with the 1.2 kb/s version, this coder is based on the 2.4 kb/s coder with a longer analysis superframe. In this case, four consecutive frames are grouped into a 90 ms superframe. For each superframe, the overall voicing and bandpass voicing decisions are first jointly vector quantized with five bits. The remaining quantization algorithms are mode-dependent, based on this voicing pattern. The pitch trajectory is quantized using up to 8 bits, using scalar quantization with adaptive interpolation. The LSF’s are quantized using a matrix extension of MSVQ, with each matrix representing the concatenation of two consecutive input vectors. Fi-

Springer Handbook on Speech Processing and Speech Communication

nally, the eight gain values in a superframe are jointly quantized using either full VQ or MSVQ, depending upon the mode. The aperiodic flag and Fourier magnitudes are not used at this low rate. Overall, the NATO STANAG 4591 MELPe coder family provides communication-quality speech at bit rates of 2.4 kb/s, 1.2 kb/s, and 600 b/s. MELPe performance at 2.4 kb/s is better than 4.8 kb/s FS1016, and there is a graceful degradation in performance with decreasing bit rate. Even at 600 b/s, MELPe performance is still better than 2.4 kb/s LPC-10e [76]. 5.3. Satellite Communications There are a number of commercial satellite communication systems for use in situations where cellular telephony may be impractical, such as for maritime applications. These include the Inmarsat system, as well as other satellite communication systems such as ICO, Iridium, ACeS, Optus, and AMSC-TMI. While the published literature on speech coding standards in such systems is limited, many use versions of MBE tailored for bit rates around 4 kb/s, for example the IMBE coder described in [77] or the more recent proprietary AMBE coder. These systems are designed to provide communications quality in potentially high bit error rates. According to one assessment [78], the AMBE coder intended for use in the Inmarsat Aeronautical system provides speech quality approximately equivalent to first-generation digital cellular (specifically the U.S. TDMA Full-rate standard VSELP algorithm [79]). In addition to these satellite communication systems, a similar technology has been standardized for North American digital land mobile radio systems by APCO-NASTD-Fed Project 25.

14

low-rate coders, 4 kb/s improvements have focussed on better representation of the current model parameters, i.e. faster frame update, more accurate modeling of spectral amplitudes, and use of complex Fourier coefficients (or equivalently waveform phase information), as in [81, 82, 83, 84]. Other approaches to the 4 kb/s problem abandoned the parametric model entirely and worked with higher-rate waveform coders, based on CELP, to lower their bit rate. Key to these approaches is the idea of relaxing the waveform matching constraint using the generalized analysis-by-synthesis coding paradigm, also known as RCELP [85]. Significant benefits are achieved by allowing the CELP synthesis to drift in time relative to the input signal according to the best pitch prediction path. This approach was taken in one of the most promising ITU 4 kb/s candidates [86]. A third alternative is a hybrid coder, using both waveform and parametric coding at different times [87, 23, 49, 88]. Since parametric coders are good at producing periodic voiced speech and waveform coders are good at modeling transitional frames, a hybrid coder uses each where it is best suited, and switches between the two as the speech changes in characteristics. A hybrid MELP/CELP coder was the basis for the other top-performing candidate [89]. After almost a decade of effort for the ITU 4 kb/s Standardization, it is clear that, while multiple lowrate coding approaches can sometimes produce very high speech quality in formal evaluations, it is difficult to achieve toll-quality performance in all conditions. The top candidates were able to improve their performance in multiple rounds of testing, but none were able to reliably achieve toll quality in all testing laboratories and languages, and as a result no ITU 4 kb/s Standard has yet been named.

5.4. ITU 4 kb/s Standardization 6. SUMMARY We conclude this section with a discussion of an effort by the Telecommunications Standardization Sector of the International Telecommunications Union (ITU-T) to standardize a toll-quality 4 kb/s speech coder for a wide range of communication applications. The ambitious goals of this project were announced in 1994 [80]. Unfortunately, achieving these goals has proven more difficult than initially anticipated. In parametric

Low-rate speech coding can now provide reliable communications-quality speech at bit rates well below 4 kb/s. While there are a number of different approaches to achieve these rates such as MELP, STC, MBE, and WI, all rely on flexible parameteric models combined with sophisticated quantization techniques to achieve this performance. Successful standardization of low-rate coders has enabled their widespread

Springer Handbook on Speech Processing and Speech Communication

use for military and satellite communications. However, the goal of toll-quality low-rate coding continues to provide a research challenge.

7. REFERENCES [1] M. R. Schroeder and B. S. Atal, “Code excited linear prediction (CELP): High quality speech at very low bit rates,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, Tampa, 1985, pp. 937–940. [2] G. Fant, Acoustic Theory of Speech Production, Mouton, The Hague, 1960. [3] L. R. Rabiner and R. W. Schafer, Digital Processing of Speech Signals, Prentice-Hall, Englewood Cliffs, NJ, 1978. [4] L. E. Kinsler et al., Fundamentals of Acoustics, John Wiley and Sons, New York, third edition, 1982. [5] B. Scharf, “Critical bands,” in Foundations of Modern Auditory Theory, Jerry V. Tobias, Ed., chapter five. Academic Press, New York, 1970. [6] D. O. Kim et al., “Responses of cochlear nucleus neurons to speech signals: Neural encoding of pitch, intensity, and other parameters,” in Auditory Frequency Selectivity, B. C. J. Moore and R. D. Patterson, Eds., pp. 281–288. Plenum Press, New York, 1986. [7] H. Dudley, “Remaking speech,” J. Acoust. Soc. Amer., vol. 11, pp. 169–177, 1939. [8] J. N. Holmes, “The influence of glottal waveform on the naturalness of speech from a parallel formant synthesizer,” IEEE Trans. Audio and Electroacoustics, vol. 21, pp. 298–305, June 1973. [9] D. H. Klatt, “Review of text-to-speech conversion for english,” J. Acoust. Soc. Amer., vol. 82, pp. 737–793, Sept. 1987. [10] F. Itakura and S. Saito, “Analysis synthesis telephony based on the maximum likelihood method,” in Rep. 6th Int. Congr. Acoustics, Aug. 1968, pp. C17–C20. [11] B. S. Atal and S. L. Hanauer, “Speech analysis and synthesis by linear prediction of the speech wave,” J. Acoust. Soc. Amer., vol. 50, no. 2, pp. 637–655, Aug. 1971.

15

[12] O. Fujimura, “An approximation to voice aperiodicity,” IEEE Trans. Audio and Electroacoustics, vol. 16, pp. 68–72, Mar. 1968. [13] J. Makhoul, R. Viswanathan, R. Schwartz, and A. W. F. Huggins, “A mixed-source model for speech compression and synthesis,” J. Acoust. Soc. Amer., vol. 64, no. 6, pp. 1577–1581, Dec. 1978. [14] S. Y. Kwon and A. J. Goldberg, “An enhanced LPC vocoder with no voiced/unvoiced switch,” IEEE Trans. Acoust., Speech, Signal Processing, vol. 32, pp. 851–858, Aug. 1984. [15] G. S. Kang and S. S. Everett, “Improvement of the excitation source in the narrow-band linear prediction vocoder,” IEEE Trans. Acoust., Speech, Signal Processing, vol. 33, pp. 377– 386, Apr. 1985. [16] M. R. Sambur, A. E. Rosenberg, L. R. Rabiner, and C. A. McGonegal, “On reducing the buzz in LPC synthesis,” J. Acoust. Soc. Amer., vol. 63, pp. 918–924, Mar. 1978. [17] D. Y. Wong, “On understanding the quality problems of LPC speech,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 1980, pp. 725–728. [18] B. S. Atal and N. David, “On synthesizing natural-sounding speech by linear prediction,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 1979, pp. 44–47. [19] A. McCree and T. P. Barnwell III, “A mixed excitation LPC vocoder model for low bit rate speech coding,” IEEE Trans. Speech and Audio Processing, vol. 3, no. 4, pp. 242–250, July 1995. [20] A. McCree and T. P. Barnwell III, “Improving the performance of a mixed excitation LPC vocoder in acoustic noise,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, San Francisco, 1992, pp. II137–II140. [21] W. Hess, Pitch Determination of Speech Signals, Springer, 1983. [22] A. McCree and T. P. Barnwell III, “A new mixed excitation LPC vocoder,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, Toronto, 1991, pp. 593–596. [23] D. L. Thomson and D. P. Prezas, “Selective modeling of the LPC residual during unvoiced

Springer Handbook on Speech Processing and Speech Communication

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

frames: White noise or pulse excitation,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, Tokyo, 1986, pp. 3087–3090. J. H. Chen and A. Gersho, “Real-time vector APC speech coding at 4800 bps with adaptive postfiltering,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, Dallas, 1987, pp. 2185–2188. W. B. Kleijn, D. J. Krasinski, and R. H. Ketchum, “Fast methods for the CELP speech coding algorithm,” IEEE Trans. Acoust., Speech, Signal Processing, vol. 38, no. 8, pp. 1330–1342, Aug. 1990. J. N. Holmes, “Formant excitation before and after glottal closure,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 1976, pp. 39–42. A. E. Rosenberg, “Effect of glottal pulse shape on the quality of natural vowels,” J. Acoust. Soc. Amer., vol. 49, pp. 583–590, 1971. A. McCree and J. C. DeMartin, “A 1.7 kb/s MELP coder with improved analysis and quantization,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, Seattle, 1998, pp. II 593–596. T. Unno, T. P. Barnwell III, and K. Truong, “An improved mixed excitation linear prediction (MELP) coder,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 1999, pp. 245–248. W. Lin, S. N. Koh, and X. Lin, “Mixed excitation linear prediction coding of wideband speech at 8 kbps,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 2000, pp. II1137 – II1140. N. R. Chong-White and R. V. Cox, “An intelligibility enhancement for the mixed excitation linear prediction speech coder,” IEEE Signal Processing Letters, vol. 10, no. 9, pp. 263 – 266, Sept. 2003. A. E. Ertan and T. P. Barnwell III, “Improving the 2.4 kb/s military standard MELP (MSMELP) coder using pitch-synchronous analysis and synthesis techniques,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 2005, pp. 761–764. R. J. McAulay and T. F. Quatieri, “Sinusoidal

[34]

[35]

[36]

[37]

[38]

[39]

[40]

[41]

[42]

[43]

[44]

16

coding,” in Speech Coding and Synthesis, W. B. Kleijn and K. K. Paliwal, Eds., chapter 4. Elsevier, 1995. T. F. Quatieri, Discrete Time Speech Signal Processing: Principles and Practice, chapter 9, Prentice Hall, 2002. P. Hedelin, “A tone-oriented voice-excited vocoder,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 1981, pp. 205–208. L. B. Almeida and F. M. Silva, “Variablefrequency synthesis: An improved harmonic coding scheme,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 1984, pp. 27.5.1 – 27.5.4. R. J. McAulay and T. F. Quatieri, “Speech analysis/synthesis based on a sinusoidal representation,” IEEE Trans. on Acoustics, Speech, and Signal Processing, vol. ASSP-34, no. 4, pp. 744–754, Aug. 1986. D. W. Griffin and J. S. Lim, “Multiband excitation vocoder,” IEEE Trans. Acoust., Speech, Signal Processing, vol. 36, no. 8, pp. 1223– 1235, Aug. 1988. E. B. George and M. J. T. Smith, “Speech analysis/synthesis and modification using an analysis-by-synthesis/overlap-add sinusoidal model,” IEEE Trans. Speech and Audio Processing, vol. 5, no. 5, pp. 389–406, Sept. 1997. C. Li and V. Cuperman, “Analysis-by-synthesis multimode harmonic speech coding at 4 kb/s,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 2000, vol. 3, pp. 1367–1370. C. O. Etemoglu, V. Cuperman, and A. Gersho, “Speech coding with an analysis-bysynthesis sinusoidal model,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 2000, vol. 3, pp. 1371 – 1374. M. S. Brandstein, “A 1.5 kbps multi-band excitation speech coder,” M.S. thesis, Massachusetts Institute of Technology, May 1990. R. McAulay, T. Parks, T. Quatieri, and M. Sabin, “Sine-wave amplitude coding at low data rates,” in Advances in Speech Coding, pp. 203–214. Kluwer Academic Publishers, Norwell MA, 1991. S. Yeldener, A. M. Kondoz, and B. G. Evans,

Springer Handbook on Speech Processing and Speech Communication

[45]

[46]

[47]

[48]

[49]

[50]

[51]

[52]

[53]

[54]

[55]

“High quality multiband LPC coding of speech at 2.4 kbit/s,” Electronics Letters, vol. 27, no. 14, pp. 1287 – 1289, July 1991. M. Nishiguchi, J. Matsumoto, R. Wakatsuki, and S. Ono, “Vector quantized MBE with simplified V/UV division at 3.0 kbit/s,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 1993, vol. 2, pp. 151 – 154. A. Das, A. V. Rao, and A. Gersho, “Variabledimension vector quantization,” IEEE Signal Processing Letters, vol. 3, no. 7, pp. 200 – 202, July 1996. P. Lupini and V. Cuperman, “Nonsquare transform vector quantization,” IEEE Signal Processing Letters, vol. 3, no. 1, pp. 1 – 3, Jan. 1996. W. B. Kleijn and J. Haagen, “Waveform interpolation for coding and synthesis,” in Speech Coding and Synthesis, W. B. Kleijn and K. K. Paliwal, Eds., chapter 5. Elsevier, 1995. W. B. Kleijn, “Encoding speech using prototype waveforms,” IEEE Trans. Speech and Audio Processing, vol. 1, no. 4, pp. 386–399, Oct. 1993. W. B. Kleijn and J. Haagen, “Transformation and decomposition of the speech signal for coding,” IEEE Signal Processing Letters, vol. 1, pp. 136–138, Sept. 1994. T. Eriksson and W. B. Kleijn, “On waveforminterpolation coding with asymptotically perfect reconstruction,” in Proc. IEEE Workshop on Speech Coding, 1999, pp. 93 – 95. N. R. Chong, I. S. Burnett, and J. F. Chicharo, “A new waveform interpolation coding scheme based on pitch synchronous wavelet transform decomposition,” IEEE Trans. Speech and Audio Processing, vol. 8, no. 3, pp. 345 – 348, May 2000. A. Gersho and R. M. Gray, Vector Quantization and Signal Compression, Kluwer Academic Publishers, 1992. H. Dudley, “Phonetic pattern recognition vocoder for narrow-band speech transmission,” J. Acoust. Soc. Amer., vol. 30, pp. 733–739, 1958. C. E. Shannon, “A mathematical theory of communication,” Bell System Tech. J., vol. 27, pp.

[56]

[57]

[58]

[59]

[60]

[61]

[62]

[63]

[64]

[65]

[66]

17

379–423,623–656, 1948. J. Picone and G. Doddington, “A phonetic vocoder,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 1989, pp. 580–583. J. Makhoul, “Linear prediction: A tutorial review,” IEEE Proceedings, vol. 63, pp. 561–579, Apr. 1975. F. Itakura, “Line spectrum representation of linear predictive coefficients of speech signals,” J. Acoust. Soc. Amer., vol. 57, pp. S35(A), 1975. K. K. Paliwal and B. S. Atal, “Efficient vector quantization of LPC parameters at 24 bits/frame,” IEEE Trans. Speech and Audio Processing, vol. 1, no. 1, pp. 3–14, Jan. 1993. J. S. Collura, A. McCree, and T. E. Tremain, “Perceptually based distortion measures for spectrum quantization,” in Proc. IEEE Workshop on Speech Coding for Telecommunications, 1995, pp. 49–50. W. P. LeBlanc, B. Bhattacharya, S. A. Mahmoud, and V. Cuperman, “Efficient search and design procedures for robust multi-stage VQ of LPC parameters for 4 kb/s speech coding,” IEEE Trans. Speech and Audio Processing, vol. 1, no. 4, pp. 373–385, Oct. 1993. W. Gardner and B. Rao, “Theoretical analysis of the high-rate vector quantization of LPC parameters,” IEEE Trans. Speech and Audio Processing, vol. 3, pp. 367–381, Sept. 1995. A. McCree and J. C. DeMartin, “A 1.6 kb/s MELP coder for wireless communications,” in Proc. IEEE Workshop on Speech Coding for Telecommunications, 1997, pp. 23–24. T. E. Tremain, “The government standard linear predictive coding algorithm: LPC-10,” Speech Technology, pp. 40–49, Apr. 1982. T. E. Tremain, M. A. Kohler, and T. G. Champion, “Philosophy and goals of the DoD 2400 bps vocoder selection process,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 1996, vol. 2, pp. 1137–1140. A. McCree, K. Truong, E. B. George, T. P. Barnwell III, and V. R. Viswanathan, “A 2.4 kbit/s MELP coder candidate for the new U.S. Federal Standard,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 1996, vol. 1, pp. 200 – 203.

Springer Handbook on Speech Processing and Speech Communication

[67] L. M. Supplee, R. P. Cohn, J. S. Collura, and A. McCree, “MELP: the new Federal Standard at 2400 bps,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 1997, vol. 2, pp. 1591 – 1594. [68] M. A. Kohler, “A comparison of the new 2400 bps MELP federal standard with other standard coders,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 1997, pp. 1587– 1590. [69] J. P. Campbell Jr., T. E. Tremain, and V. C. Welch, “The DoD 4.8 kbps Standard (Proposed Federal Standard 1016),” in Advances in Speech Coding, pp. 121–133. Kluwer Academic Publishers, Norwell MA, 1991. [70] S. Villette, K. T. Al Naimi, C. Sturt, A. M. Kondoz, and H. Palaz, “A 2.4/1.2 kbps SBLPC based speech coder: the Turkish NATO STANAG candidate,” in Proc. IEEE Workshop on Speech Coding, 2002, pp. 87 – 89. [71] G. Guilmin, P. Gournay, and F. Chartier, “Description of the French NATO candidate,” in Proc. IEEE Workshop on Speech Coding, 2002, pp. 84–86. [72] T. Wang, K. Koishida, V. Cuperman, A. Gersho, and J. S. Collura, “A 1200/2400 bps coding suite based on MELP,” in Proc. IEEE Workshop on Speech Coding, Tsukuba, 2002, pp. 90–92. [73] J. S. Collura, D. F. Brandt, and D. J. Rahikka, “The 1.2 kbps/2.4 kbps MELP speech coding suite with integrated noise pre-processing,” in IEEE Military Communications Conference Proceedings, 1999, vol. 2, pp. 1449 – 1453. [74] R. Martin and R. V. Cox, “New speech enhancement techniques for low bit rate speech coding,” in Proc. IEEE Workshop on Speech Coding, 1999, pp. 165 – 167. [75] T. Wang, K. Koishida, V. Cuperman, A. Gersho, and J. S. Collura, “A 1200 bps speech coder based on MELP,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 2000, pp. 1375 – 1378. [76] G Guilmin, F. Capman, B. Ravera, and F. Chartier, “New NATO STANAG narrow band voice coder at 600 bits/s,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 2006, pp. 689–693.

18

[77] J. C. Hardwick and J. S. Lim, “The application of the IMBE speech coder to mobile communications,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 1991, vol. 1, pp. 249–252. [78] S. F. Campos Neto, F. L. Corcoran, J. Phipps, and S. Dimolitsas, “Performance assessment of 4.8 kbit/s AMBE coding under aeronautical environmental conditions,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 1996, vol. 1, pp. 499 – 502. [79] I. A. Gerson and M. A. Jasiuk, “Vector sum excited linear prediction (VSELP),” in Advances in Speech Coding, B. S. Atal, V. Cuperman, and A. Gersho, Eds., pp. 69–79. Kluwer Academic Publishers, 1991. [80] S. Dimolitsas, C. Ravishankar, and G. Schroder, “Current objectives in 4-kb/s wireline-quality speech coding standardization,” IEEE Signal Processing Letters, vol. 1, no. 11, pp. 157–159, Nov. 1994. [81] E. L. T. Choy, “Waveform interpolation speech coder at 4 kb/s,” M.S. thesis, McGill University, 1998. [82] O. Gottesman and A. Gersho, “Enhanced waveform interpolative coding at low bit-rate,” IEEE Trans. Speech and Audio Processing, vol. 9, no. 8, pp. 786 – 798, Nov. 2001. [83] J. Stachurski, A. McCree, and V. Viswanathan, “High quality MELP coding at bit-rates around 4 kb/s,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, Phoenix, 1999, pp. 485–488. [84] S. Yeldener, “A 4 kb/s toll quality harmonic excitation linear predictive speech coder,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 1999, vol. 1, pp. 481 – 484. [85] W. B. Kleijn, R. P. Ramachandran, and P. Kroon, “Generalized analysis-by-synthesis coding and its application to pitch rediction,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 1992, vol. 1, pp. 337–340. [86] J. Thyssen, Y. Gao, A. Benyassine, E. Shlomot, C. Murgia, H. Su, K. Mano, Y. Hiwasaki, H. Ehara, K. Yasunaga, C. Lamblin, B. Kovesi, J. Stegmann, and H. Kang, “A candidate for the ITU-T 4 kbit/s speech coding standard,” in

Springer Handbook on Speech Processing and Speech Communication

Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 2001, vol. 2, pp. 681–684. [87] I. M. Trancoso, L. Almeida, and J. M. Tribolet, “A study on the relationships between stochastic and harmonic coding,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 1986, pp. 1709–1712. [88] E. Shlomot, V. Cuperman, and A. Gersho, “Combined harmonic and waveform coding of speech at low bit rates,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, Seattle, 1998, pp. 585–588. [89] A. McCree, J. Stachurski, T. Unno, E. Ertan, E. Paksoy, V. Viswanathan, A. Heikkinen, A. Ramo, S. Himanen, P. Blocher, and O. Dressler, “A 4 kb/s hybrid MELP/CELP speech coding candidate for ITU standardization,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 2002, vol. 1, pp. 629–632.

19

Springer Handbook on Speech Processing and Speech Communication

Parameters LSF’s Fourier magnitudes Gain (2 per frame) Pitch and overall voicing Bandpass voicing Aperiodic flag Error protection Sync bit Total bits / 22.5 ms

Voiced 25 8 8 7 4 1 1 54

Unvoiced 25 8 7 13 1 54

Table 1: 2.4 kb/s MELP coder bit allocation.

20

Springer Handbook on Speech Processing and Speech Communication

21

1.2

g(n)

0.8

0.4

0

-0.4

0

40

80

120

160

200

TIME (IN SAMPLES AT 8 kHz) (a)

30

 20 log10 G ej2πF T

20 10 0

-10 -20 -30 0

1

2

3

4

FREQUENCY (kHz) (b) Figure 1: (a) Stylized glottal pulse and (b) corresponding Fourier transform (from Rabiner and Schafer, p. 104).

Springer Handbook on Speech Processing and Speech Communication

22

×104

2

AMPLITUDE

1.5 1 0.5 0 -1 -1.5

0

50

100

150

200

250

300

350

TIME (IN SAMPLES AT 8 kHz) (a) 120

LOG MAGNITUDE IN DB

110 100 90 80 70 60 50

0

1

2

3

4

FREQUENCY (kHz) (b) Figure 2: Sustained vowel /æ/: (a) waveform and (b) Fourier spectrum.

Springer Handbook on Speech Processing and Speech Communication

23

1500

AMPLITUDE

1000 500 0 -500

-1000 -1500

0

50

100

150

200

250

300

350

TIME (IN SAMPLES AT 8 kHz) (a) 100

LOG MAGNITUDE IN DB

90 80 70 60 50 40 30

0

1

2

3

4

FREQUENCY (kHz) (b) Figure 3: Sustained fricative /sh/: (a) waveform and (b) Fourier spectrum.

Springer Handbook on Speech Processing and Speech Communication

24

2000

AMPLITUDE

1500 1000 500 0 -500

-1000 -1500

0

50

100

150

200

250

300

350

TIME (IN SAMPLES AT 8 kHz) (a) 90

LOG MAGNITUDE IN DB

80 70 60 50 40 30 20

0

1

2

3

FREQUENCY (kHz) (b) Figure 4: Plosive /t/: (a) waveform and (b) Fourier spectrum.

4

Springer Handbook on Speech Processing and Speech Communication

25

PITCH PERIOD

Av IMPULSE TRAIN GENERATOR

GLOTTAL PULSE MODEL

VOCAL TRACT PARAMETERS

G(z)

VOICED/ UNVOICED SWITCH

uG (n)

VOCAL TRACT MODEL

RADIATION MODEL

V (z)

R(z)

RANDOM NOISE GENERATOR

AN

Figure 5: Simple linear model of speech production (from Rabiner and Schafer, p. 105).

CRITICAL BANDWIDTH (Hz)

5K

2K 1K 500

200 100 50 20

50

100

200

500

1K

2K

5K

10K

FREQUENCY (Hz) Figure 6: Critical bands as a function of center frequency.

20K

pL (n)

Springer Handbook on Speech Processing and Speech Communication

26

5000 4000 3000 2000 1000 0 -1000 -2000 -3000 -4000 -5000 1050

1100

1150

1200

1250

1300

1350

1400

1450

Figure 7: Example bandpass filter output.

PERIODIC PULSE TRAIN

LPC SYNTHESIS FILTER

GAIN

WHITE NOISE

Figure 8: LPC vocoder synthesizer.

SYNTHESIZED SPEECH

Springer Handbook on Speech Processing and Speech Communication

PERIODIC PULSE

POSITION

SHAPING

JITTER

FILTER

APERIODIC

BANDPASS VOICING

FLAG

STRENGTHS

TRAIN

WHITE

SHAPING

NOISE

FILTER

ADAPTIVE SPECTRAL ENHANCEMENT

LPC SYNTHESIS FILTER

GAIN

PULSE DISPERSION FILTER

Figure 9: MELP synthesizer.

27

SYNTHESIZED SPEECH

Springer Handbook on Speech Processing and Speech Communication

5000

28

1500

4000 1000

3000 2000

500

1000 0

0

-1000 -500

-2000 -3000

-1000

-4000 -5000 0

50

100

150

200

250

300

350

400

-1500

0

20

40

60

80

(a)

100

120

140

160

180

200

120

140

160

180

200

(b)

1000

2000

800

1500 1000

600

500

400 0

200 -500

0 -1000

-200

-400 0

-1500

20

40

60

80

(c)

100

120

140

160

180

200

-2000 0

20

40

60

80

100

(d)

Figure 10: Natural speech vs. decaying resonance waveforms: (a) first formant of natural speech vowel, (b) synthetic exponentially decaying resonance, (c) pole/zero enhancement filter impulse response for this resonance, (d) enhanced decaying resonance.

Springer Handbook on Speech Processing and Speech Communication

1

1

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

0

-0.2

-0.2

-0.4

-0.4

-10

0

10

20

30

40

50

60

70

-10

0

10

29

20

(a)

30

40

50

60

70

(b)

4

Log magnitude in dB

2 0 -2 -4 -6 -8 -10 0

500

1000

1500 2000 2500 Frequency in Hz

3000

3500

4000

(c)

Figure 11: Synthetic triangle pulse and FIR filter: (a) triangle waveform, (b) filter coefficients after spectral flattening with length 65 DFT, (c) Fourier transform (DTFT) after spectral flattening.

Index adaptive spectral enhancement, 6 aperiodic pulse, 5 CELP, 1 glottal pulse, 2 human speech perception, 2 linear predictive coding (LPC), 4 low-bit-rate speech coding, 1 LPC filter quantization, 12 matrix quantization, 12 MBE, 7 MELP, 5 MELPe, 13 MIL-STD 3005, 12 mixed excitation, 5 NATO STANAG 4591, 13 parametric speech coder, 1 peakiness, 6 pitch period, 2 pulse dispersion filter, 6 sinusoidal coding, 7 speech production, 2 STC, 7 unvoiced speech, 2 vector quantization (VQ), 11 vocoder, 3 voiced speech, 2 waveform interpolation, 9

30