Formant Enhancement Based Speech ... - Semantic Scholar

Report 6 Downloads 118 Views
INTERSPEECH 2014

Formant Enhancement based Speech Watermarking for Tampering Detection Shengbei Wang1 , Masashi Unoki1 , and Nam Soo Kim2 1

2

School of Information Science, Japan Advanced Institute of Science and Technology, Japan School of Electrical Engineering and INMC, Seoul National University Seoul, Republic of Korea [email protected], [email protected], [email protected]

Sender side

Abstract

Original signal

Unauthorized tampering in speech signals has brought serious problems when verifying the originality and integrity of speech signals. Digital watermarking can effectively check if the original signals have been tampered by embedding digital data into them. This paper proposes a tampering detection scheme for speech signals based on formant enhancement-based watermarking. Watermarks are embedded as slight enhancement of formant by symmetrically controlling a pair of linear spectral frequencies (LSFs) of corresponding formant. We evaluated the proposed scheme with objective evaluations concerning three criteria that are required for tampering detection scheme: (i) inaudibility to human auditory system, (ii) robustness against meaningful processing, and (iii) fragility against tampering. The evaluation results showed that the proposed scheme could provide satisfactory performance in all the criteria and had the ability to detect tampering in speech signals. Index Terms: tampering detection, speech watermarking, formant enhancement, inaudibility, robustness, fragility

x(n)

Detected watermarks

Watermark embedding

y(n) Watermark ŝ(m)

y(n) Transmission

detection

Tampering result

Comp.

s(m) Embedded watermarks

Figure 1: Proposed scheme of tampering detection. The second requirement guarantees that meaning processing will not destroy watermarks and thereby nullify them for tampering detection. The last requirement indicates that the watermarks will be destroyed and fail to be detected once tampering has occurred to the watermarked signal. Since the last two requirements guarantee that watermarks can only be destroyed by tampering, watermarking with both robustness and fragility can effectively identify tempering with destroyed watermarks. In previous studies, Celik et al. have proposed a semifragile watermarking method by introducing small changes to fundamental frequency [5]. Wu et al. have implemented a fragile speech watermarking for tampering detection based on odd/even modulation with exponential scale quantization [6]. Unoki and Hamada have suggested a watermarking method based on the characteristic of cochlear delay (CD) [7], [8]. A method developed from [8] has been presented in [9] for tampering detection. Other methods are widely found in [10]-[14]. Since the requirements for watermarking usually conflict with each other, such as inaudibility and robustness, and robustness and fragility, many existing watermarking methods, for example [6], [9], and [10], cannot always satisfy all requirements. This paper aims to realize a tampering detection scheme based on speech watermarking that can satisfy all requirements. Our previous works have separately showed the feasibility to embed watermarks into linear spectral frequencies (LSFs) [15], and control a pair of LSFs as formant enhancement to achieve good inaudibility and robustness for watermarking [16]. This paper investigates if the previous method [16] is fragile against tampering to be developed for tampering detection.

1. Introduction Digital technologies have greatly facilitated the edit and duplication of speech signals. These technologies, however, have been accompanied by new social issues in related to illegal edit and unauthorized tampering (intentional or unintentional) of speech signals. In addition, some advanced speech analysis/synthesis methods such as STRAIGHT [1] and its applications, e.g., voice conversion [2], speech morphing [3], and singing-voice conversion [4] are capable of producing high quality and intelligibility of tampered speech although important information has been changed. All these progresses have increased the possibility of tampering in speech signals. As an important information carrier, the originality of speech, especially for those used in digital forensics where speech may be recovered from digital devices and served as the evidence to support or refute a hypothesis, should be strictly confirmed. Conventional cryptography-based methods can prevent speech from tampering. However, this kind of methods is invalid to detect tampering if a legal consumer edits or distributes the decrypted signal or if the decryption key is captured by illegal users. In comparison, watermarking methods do not suffer from the above drawback, since this technique directly embed information (referred as watermarks) into original signal. Moreover, the embedded information can permanently exist and is difficult to remove. To effectively detect tampering, watermarking method should simultaneously satisfy three requirements of (i) inaudibility to human auditory system (HAS), (ii) robustness against meaningful processing, and (iii) fragility against tampering. The first requirement ensures that the watermarks will not degrade the speech quality or affect its value in use.

Copyright © 2014 ISCA

Receiver side

Watermarked signal

2. Scheme of tampering detection In the process of speech transmission, speech is possibly captured and tampered by attackers. For example, by using voice conversion [2], speech content (what is the speaker saying) can be tampered, e.g., a word replacement from “Yes” to “No”; by using speech morphing [3], the individuality of speaker (who is saying) can be deliberately transformed to that of another speaker. In case of digital forensics where speaker identity plays a key role, any single word change or forged speaker will result in serious problem for judgement. To detect whether tampering has occurred in speech transmission, this paper proposes a tampering detection scheme

1366

14- 18 September 2014, Singapore

: Original LSFs

0

φl φlw

: Shifted LSFs

φrw φr

Original signal x(n)



fl

flw

frw f fr BWew BW

Frequency (Hz)

LP synthesis

LSFsà LP coeff.

LP coeff.

φlw = φl +∆ and φrw = φr −∆,

LSFs

0 < φ1 < φ2 < φ3 < · · · < φp < π

Qew =

(4)

f f = BWew frw − flw

(5)

where flw and frw are calculated as follows: frw

=

φrw × Fs 2π

and

flw =

φlw × Fs 2π

(6)

Note that in the above formant enhancement method, two LSFs are symmetrically shifted, so there is no deviation in center frequency which furthest maintains original sound quality. II) Watermark embedding process Watermarks can be embedded into the LSFs with the above formant enhancement method. According to Fig. 3, watermarks are embedded into original signal as follows. (i) Original signal, x(n), is first segmented into nonoverlapping frames. For each frame, LP analysis is applied to extract LP coefficients and LP residue. LP coefficients are then converted to LSFs for representing the formants in each frame. (ii) Each frame will be embedded with one bit watermark, ‘0’ or ‘1’ (according to s(m)). ‘0’ is embedded by enhancing the sharpest formant while ‘1’ is embedded by enhancing the second sharpest formant. Figures 4(a) and 4(b) separately show the rules for embedding ‘0’ and ‘1’. Since the closer two LSFs are, the sharper the formant is, the sharpest formant in Figs. 4(a) and 4(b) has the smallest bandwidth BWab , and the second sharpest formant has the second smallest bandwidth BWcd . A. Rule of embedding ‘0’: To embed ‘0’, the sharpest formant will be enhanced Ωe0 (Ωe0>1) times. Therefore, original BWab have to be reduced to its 1/Ωe0 . As shown in Fig. 4(a), original LSFs φa and φb in (7) will be shifted to φaw and φbw .

(1)

In general, each formant can be produced by two adjacent LSFs, the closer two LSFs are, the sharper the formant is. Therefore, formant can be enhanced by closing up two LSFs. Figure 2 shows a formant (dotted curve) produced by two adjacent LSFs, φl and φr . Its sharpness can be mathematically measured by tuning level, that is Q-value defined in (2), where f is the center frequency, BW is the bandwidth between φl and φr after converting them to fl and fr in the frequency-domain with (3), in which Fs is the sampling frequency of signal. f f = BW fr − fl φr φl × Fs and fl = × Fs 2π 2π

0 bwab × (Ωe0 − 1)/2 otherwise

(12)

2.2. Tampering detection process

(a)

Magnitude (dB)

0, 1,

BWcd φcw φdw φc φd

0

As to verify if tampering has occurred to the watermarked signal before receiving it at receiver side, detected sˆ(m), will be compared with the embedded s(m). If there is no mismatch, this means no tampering has occurred to the received signal; otherwise, each mismatch indicates that the corresponding frame in received signal has been possibly tampered. For example, if s(m)=01001101... while sˆ(m)=01101101..., this indicates the third frame may have been tampered.

BWabw

BWab

φa

φb



LSFs (rad)

(b)

BWab Frequency(Hz)

BWcdw

BWcd

3. Evaluations

Figure 4: Concept of watermark embedding: (a) embed ‘0’ and (b) embed ‘1’.

We evaluated the proposed method with respect to inaudibility, robustness, and fragility. All 12 speech stimuli in the ATR database (B set) [20] (Japanese sentences uttered by six males and six females, 8.1-sec, 20 kHz, 16 bits) were used. The frame size was 25 ms (40 frames in 1.0-sec). Every 10 frames were embedded with the same watermark and detected the watermark with majority decision. Thus, the bit rate for embedding was 4 bps. LP order was 10-th. Ωe0 for embedding ‘0’ was adopted as 2.0 to balance the conflicting requirements of inaudibility, robustness, and fragility (Ωe1 for ‘1’ was automatically fixed based on bandwidth characteristics of each frame). Embedded watermarks was a single word “GOOD”. Evaluations were also done to two other methods: the least significant bit-replacement (LSB) method [10] and cochlear delay (CD) method [8] to make a comparable study. First, we evaluated the inaudibility and normal detection performance of the proposed method. The log-spectrum distortion (LSD) [21] and perceptual evaluation of speech quality (PESQ) [22] were used to check inaudibility. LSD in decibel (dB) measured the spectra distance between original signal and and watermarked signal. LSD of 1.0 dB was chosen as the criterion. PESQ in the objective difference grades (ODGs) that covered from −0.5 (very annoying) to 4.5 (imperceptible) was used to evaluate subjective quality, 3.0 (slightly annoying) was set as the criterion. Detection performance was checked by correct Bit Detection Rate (BDR). The criterion for BDR was 90%. The evaluation results are plotted in Fig. 6. As we can see, all the methods could satisfy criteria for LSD, PESQ and BDR, especially, LSB and the proposed method performed quite well. These results indicated a good performance of the proposed method for these evaluations. Second, we evaluated the robustness of the proposed method. We applied these typical speech codecs of G.711 (pulse code modulation (PCM)), G.726 (adaptive differential PCM (ADPCM)), and G.729 (Code-excited linear prediction (CELP)) to the watermarked signals. Figure 7 plots the BDR results after these speech codecs. As we can see, LSB was not robust against any speech codec; CD was only robust against G.711; the propose method could survive from all speech codecs (100% for G.711 and G.726, around 90% for G.729). This implied the proposed method was robust against these speech codecs. We then evaluated the proposed method against several meaningful processing. These included re-sampling at 24 kHz and 12 kHz, re-quantization with 24 bits and 8 bits, signal amplifying by 2.0 times, and speech analysis/synthesis by short-time Fourier transform (STFT) and gammatone filterbank (GTFB). The BDR results after each processing (calculated on the average of twelve stimuli) have been plotted in Fig. 8. It was clear that the proposed method was more robust than LSB and CD.

θa, θb, θc, θd : LSFs in watermarked frame

(a) bwcd>bwab×Ωe0 θc

θd bwcd

(b) |bwcd-bwab|≈0

θa

θb

bwab

θc

θd

θa

bwcd

θb bwab

Figure 5: Concept of watermark detection: (a) ‘0’ is detected and (b) ‘1’ is detected. BWcd (Ωe1 = BW ) times. With this Ωe1 , original BWcd of the secab ond sharpest formant will be reduced to the same as BWab of the sharpest formant, that is BWcdw =BWab . To achieve this, φc and φd will be shifted to φcw and φdw with (9).

φcw = φc + ∆e1 and φdw = φd − ∆e1 where ∆e1 is calculated by φc , φd and Ωe1 with (10).    1 1 ∆e1 = (φd − φc ) × 1 − 2 Ωe1

(9)

(10)

(iii) After the above process, a pair of shifted LSFs (φaw and φbw for embedding ‘0’, or φcw and φdw for embedding ‘1’) are generated. These two LSFs and the other un-shifted LSFs are converted to LP coefficients for resynthesizing current frame with LP residue. Watermarked signal is constructed by all watermarked frames using non-overlapping and adding function. Note that this watermarking method can be applied for both voiced/unvoiced speech segments, while the formants detected from unvoiced speech segment are pseudo-formants. III) Watermark detection process According to the embedding process, watermarks in received watermarked signal can be easily detected as follows. For each watermarked frame, we separately extract two smallest bandwidths from the sharpest and the second sharpest formants, and name them as bwab (the smallest) and bwcd (the second smallest). According to Fig. 5(a), if ‘0’ has been embedded, we have bwcd >bwab ×Ωe0 , an equivalent expression is given in (11); if ‘1’ has been embedded, according to Fig. 5(b), bwcd should be equal to bwab . Therefore, a threshold in (12) is set to discriminate two cases of embedding ‘0’ or ‘1’, and enable the method to be error-tolerant. Each frame can be extracted with one bit. All extracted bits can construct the detected watermarks, sˆ(m). bwcd − bwab > bwab × (Ωe0 − 1)

(11)

1368

LSDcldBN

2

PESQclODGN

Table 1: Bit detection rate in fragility evaluation. Tampering type Add white noise Concatenation High-pass filtering Speed down -4%

1 0 1

2

3

4

5

6

7

8

9

10

11

12

3

4

5

6

7

8

9

10

11

12

4

Tampering type Reverberation Low-pass filtering Speed up +4% Pitch shift

BDR (%) 68.80 41.98 71.56 68.12

lbNcPESQ 2

(a)

(b)

(c)

(d)

(e)

(i)

(j)

50 lcNcBitcDetectioncRate 0 1

2

3

4

5

6 7 8 No. of stimulus StimuluscNumber

9

10

11

12

(f)

Figure 6: Evaluations of proposed method:(a) LSD, (b) PESQ, and (c) bit detection rate. BDRi(b)

50

Proposed CD LSB

(a)iG.711

BDRi(b)

0 1 100

2

3

4

5

6

7

8

9

10

11

12

3

4

5

6

7

8

9

10

11

12

3

4

5

6 7 8 No. of stimulus StimulusiNumber

9

10

11

12

(a)iG.726

BDRi(b)

2

50 (a)iG.729 0 1

2

Figure 7: Evaluations of robustness against speech codecs.

100 46

50.3

75.2

89.2

100

Req. 8 bits

59.1

60

61.6

66.9

96.1 96.1

Req. 24 bits GTFB

96.1 96.1 96.1

100 90 80

Res. 12 kHz STFT

100 100 100 86.6 100 100 100

Res. 24 kHz Scaling

40 20 0

Proposed

CD

(h)

cut-off frequency: 0.99), filtering with high-pass filter (order: 32-th, normalized cut-off frequency: 0.01), speed up by +4%, speed down by −4%, and pitch shift by −4%. Speed up and speed down can change the duration of speech without affecting its pitch. It is also referred as tempo change. Pitch shift is to proportionally shift frequency components while preserving the duration of speech. It can be regarded as changing gender in related to tampering speaker. In Fig. 9, detected images from each tampered signal are shown. The BDR calculated from each figure has been listed in Tab.1. As we can see, watermarks in the tampered region were destroyed, so the BDR drastically reduced. This suggested the proposed method was fragile against tampering, and with such low BDR, tampering can be inferred. The above evaluations indicated the proposed method had good performance in inaudible, robustness, and fragility. Moreover, it could detect tampering with its fragility. In comparison, since LSB embeds watermarks in the least significant bits, watermarks can be easily reset by other processing, which makes LSB not robust; CD embeds watermarks in phase by modelling the cochlear delay, according to the characteristics of cochlear delay, watermark detection strongly depends on the low frequency phase. Once low frequency phase is destroyed by other processing, e.g., GTFB and G.729 codec, watermarks cannot be detected. Therefore, CD is not always robust.

50 0 1 100

(g)

Figure 9: Evaluations of fragility against tampering: (a) embedded image and detected images in different cases: (b) no tampering, (c) adding white noise (d) reverberation, (e) concatenation, (f) low-pass filtering, (g) high-pass filtering, (h) speed up +4%, (i) speed down −4%, and (j) pitch shift −4%.

100

BDR (%)

BDR (%) 45.19 42.86 49.85 79.51

2

0 1 100

BDRcl%N

Proposed CD LSB

laNcLSD

LSB

Figure 8: Evaluations of robustness against processing.

4. Conclusion

Finally, we evaluated the fragility of the proposed method. Since CD and LSB are not completely robust, even if they are fragile against tampering, they are unable to tell whether failed detection of watermarks is caused by meaningful processing or tampering. That is to say, they cannot be successfully used for tampering detection unless being improved. Therefore, fragility evaluation was only conducted to the proposed method. As to intuitively reflect fragility, a 32×32 bitmap image in Fig. 9(a) was embedded as watermarks. Since bit rate was 4 bps, as to embed the complete image, a long speech (256-sec) that repeatedly combined with 12 stimuli was used as the original signal. After embedding the image to original signal, we separately tampered the middle segment of watermarked signal with the following realistic tampering [9]: adding white noise, reverberation (time: 0.3-sec), concatenation with un-watermarked speech, filtering with low-pass filter (order: 32-th, normalized

This paper proposed a tampering detection scheme based on speech watermarking. Watermarks were embedded as formant enhancement by controlling a pair of LSFs. We evaluated the proposed scheme with respect to inaudibility, robustness, and fragility. The evaluation results revealed that the proposed scheme could satisfy all these requirements, which means it could provide effective detection of tampering. While the proposed scheme is a frame-based scheme, a frame synchronization method will be implemented in the next stage.

5. Acknowledgements This work was supported by a Grant-in-Aid for Scientific Research (B) (No. 23300070), an A3 foresight program made available by the Japan Society for the Promotion of Science, the telecommunication advancement foundation, and funding by China Scholarship Council.

1369

6. References [1] H. Kawahara , I. Masuda-Kasuse, and A. de Cheveigne, “Restructuring speech representations using a pitch-adaptive timefrequency smoothing and an instantaneous-frequency-based F0 extraction: possible role of a reptitive Structure in Sounds,” Speech Communication, vol. 27, pp. 187-207, 1999. [2] T. Toda, A. W. Black, and K. Tokuda. “Voice conversion based on maximum likelihood estimation of spectral parameter trajectory,” IEEE Trans. Audio, Speech and Language Proc., vol. 15, no. 8, 2222–2235, 2007. [3] H. Kawahara, H. Banno, T. Irino and P. Zolfaghari, “ALGORITHM AMALGAM: Morphing waveform based methods, sinusoidal models and STRAIGHT,” Proc. ICASSP, 13–16, 2004. [4] T. Saitou, M. Goto, M. Unoki, and M. Akagi, “Vocal Conversion from speaking voice to singing voice using STRAIGHT,” Proc. Synthesis of Singing Challenge, Special Session at Interspeech, 2007. [5] M. Celik, G. Sharma, and A. M. Tekalp, “Pitch and duration modification for speech watermarking,” Proc. ICASSP, vol. II, pp. 17–20, 2005. [6] C. Wu and C. Jay Kuo, “Fragile speech watermarking based on exponential scale quantization for tamper detection,” Proc. ICASSP, vol. IV, pp. 3305–3308, 2002. [7] M. Unoki and D. Hamada, “Method of digital-audio watermarking based on cochlear delay characteristics,” J. Inn. Com. Inf., and Cont., vol. 6, no.(3(B)), pp. 1325–1346, 2010. [8] M. Unoki and R. Miyauchi, “Reversible watermarking for digital audio based on cochlear delay characteristics,” Proc. IIHMSP, pp. 314–317, 2011. [9] M. Unoki and R. Miyauchi, “Detection of tampering in speech signal with inaudible watermarking technique,” Proc. IIHMSP, pp. 118–121, 2012. [10] P. Bassia and I. P. Pitas, “Robust audio watermarking in the time domain,” Proc. EUSIPCO, pp. 25-28, 1998. [11] L. Boney, H. H. Tewfik, and K. H. Hamdy, “Digital watermarks for audio signals,” Proc. ICMCS, pp. 473-480, 1996. [12] Q. Cheng and J. Sorensen, “Spread spectrum signalling for speech watermarking,” Proc. ICASSP, vol. V, pp. 1337-1340, 2001. [13] M. Narimannejad and A. S. Mohammad, “Watermarking of speech signal through phase quantization of sinusoidal model,” Proc. ICEE, pp. 1–4, 2011. [14] M. D. Swanson, B. Zhu, A. H. Tewfik, and L. Boney, “Robust audio watermarking using perceptual masking,” Signal Processing, vol. 66, no. 3, pp. 337-355, 1998. [15] S. Wang and M. Unoki, “Watermarking method for speech signals based on modifications to LSFs,” Proc. IIHMSP, pp. 283– 286, 2013. [16] S. Wang and M. Unoki, “Watermarking of speech signals based on formant enhancement,” Proc. EUSIPCO, (Accepted). [17] T. Raitio, A. Suni, H. Pulakka1, M. Vainio, and P. Alku, “Comparison of formant enhancement methods for HMM-based speech synthesis,” Proc. ISCA Speech Synthesis Workshop, 2010. [18] HTS, “HMM-based speech synthesis system,” http://hts.sp.nitech.ac.jp, 2009. [19] Recommendation ITU-T P.800, “Methods for subjective determination of transmission quality,” International Telecommunication Union, 1996. [20] K. Takeda et al, “Speech database user’s manual,” ATR Technical Report TR-I-0028, 2010. [21] A. Gray, Jr., and J. Markel, “Distance measures for speech processing,” IEEE Trans. Acoustics, Speech and Signal Proc., vol. 24, no. 5, pp. 380–391, 1976. [22] Y. Hu and P. C. Loizou, “Evaluation of objective quality measures for speech enhancement,” IEEE Trans. Audio, Speech, and Language Proc., vol. 16, no. 1, pp. 229–238, 2008.

1370