SPATIALLY-ADAPTIVE WAVELET IMAGE COMPRESSION VIA STRUCTURAL MASKING Matthew Gaubatz, Stephanie Kwan, Bobbie Chern, Damon Chandler and Sheila S. Hemami School of Electrical and Computer Engineering Cornell University, Ithaca, NY 14853 ABSTRACT Wavelet-based spatial quantization is a technique to compress image data that adapts the compression to the data in each region of an image. This approach is motivated due to the fact that quantization with a single step-size does not result in a uniform visual effect across each spatial location; different types of image content mask quantization errors in different ways. While many spatial quantization techniques determine step-sizes through local activity measures, the proposed method induces local quantization distortion based on experiments that quantify human detection of this distortion as function of both the contrast and the type of the image data. Three types in particular, textures, structures and edges, are considered. A classifier is utilized to detect to which of these 3 categories a region of image data belongs, and stepsizes are then derived based on the contrast and classification of local regions. The classification and constrast data are conveyed to the coder with explicit side information. For images compressed at threshold, the proposed method requires 3-10 % less rate than a similar previous approach without classification, and on average are preferred by 2/3 of tested viewers. 1. INTRODUCTION It is well known that image content masks distortions induced by quantization of wavelet coefficients [1–4]. A natural consequence of this observation is that wavelet compression algorithms can be designed to quantize data more coarsely in regions that will hide the resulting distortions. Indeed, several wavelet-based approaches have been presented that adapt quantization step-sizes based on the local activity of image data in each region of the image [5,6]. These schemes predict step-size based on the activity of previously coded data. A more recent approach is based on local texture masking [7], where side information is required to explicitly represent spatially-adaptive, experimentally-derived step-sizes. In [7], quantization step-sizes are derived based on the masking ability, i.e. the contrast of local image data, and the assumption that wavelet image data is well-modeled as homogeneous regions of texture. A fairly high-resolution map of local image contrasts is required, however, to prevent data in other nontexture regions from being quantized too coarsely. In some cases, the rate to represent this map accounts for over 5 % of
the total coded image rate. Furthermore, it has been shown that masking thresholds differ based on the structural class of a region of data [8], that is, based on whether the region contains textures, structures, or edges. This kind of masking is referred to as structural masking. Region classification has been shown to improve coding gain in an MSE-based rate-distortion image coding framework [9]. In the same way, the coder proposed herein uses a classification scheme to more efficiently leverage structural masking properties of regions containing texture, structure, or edge data. Quantization step-sizes are derived based on masking results that are specific to each structural class. Key contributions of this work involve the application of structural masking to compression, which demonstrates how explicit classification information can reduce the side information required to convey spatially-adaptive step-sizes, that further reduce overall coded rate and maintain image quality. This paper is organized as follows. A structural masking experiment and the associated results are summarized in section 2. A spatially-adaptive wavelet-based coder that leverages these results is discussed in section 3, which includes a description of an example scheme to determine structural classes of image regions. Finally, section 4 illustrates some results and concludes the paper. 2. CONTRAST- AND STRUCTURAL-CLASS-ADAPTIVE MASKING The coder proposed in this paper uses spatially-adaptive quantization step-sizes based on (1) the contrast and (2) the structural class of each image region, since these factors affect how the region masks quantization distortion. Root-mean-squared (RMS) contrast of a region, Crms , is defined as Crms =
standard deviation of region , mean value of region
(1)
where these statistics are computed in the luminance (instead of pixel) domain. The contrast threshold for distortions viewed by an observer against a region of image data , i.e., a masker, is the maximum RMS contrast of distortions that are barely visible to the observer. The following experiment was performed to determine how contrast thresholds vary as a function of the structural class and RMS contrast.
textures structures edges
−1
−1.5
−2
log
10
mean RMS contrast of distortion
−0.5
−2.5 −2
−1.5 −1 −0.5 log RMS contrast of masker
0
10
Fig. 1. An illustration of the average threshold-versuscontrast (TvC) curves for the masking experiment, plotted against logarithmic axes. Notice that the threshold contrast increases as a function of masker contrast for each type of masker. At a fixed contrast, edge data masks considerably less distortion than structure data, which mask slightly less distortion than textures. These curves suggests how to quantitatively distribute compression errors across structural classes.
Fourteen total 64×64 pixel maskers were used (four from the texture class, five from the structure class, and five from the edge class), cropped from 8-bit 512 × 512 greyscale images. Several versions of each masker were generated by adjusting its RMS contrast to a set of different values. Targets for this experiment were generated by quantizing all HL3 subband coefficients (of each version) of each masker with a single step-size; this procedure was repeated with several step-sizes to generate distortions with different RMS contrast values. Four subjects participated in the experiment. After allowing time to adapt to the lighting conditions of the testing environment, subjects were shown a set of two maskers at a fixed viewing distance, one with and one without quantization distortions, and asked which masker contained the distortions. If no response was given after a time limit, it was assumed that the viewer chose the incorrect masker. The experiment was double blind and detection thresholds were determined with a QUEST staircase. Further details regarding this experiment may be found in [8]. The resulting threshold-versus-contrast (TvC) curves, averaged over observers, are illustrated in Figure 1. Detection thresholds differ for each type of content. For instance, the thresholds for edge data are considerably lower, which makes sense since aliasing is often easier to see along crisp boundaries instead of textures resembling noisy data. Previous work demonstrated how these thresholds vary as a function of spatial frequency [4]; similar methods are used to extend the scale 3 TvC curves to other scales. For each region of each subband, step-sizes for wavelet coefficients are selected as those that induce the threshold contrasts. For a decoder to determine step-sizes, the contrast and class of each region must be coded in the bit-stream. These quantities are denoted the contrast map and classification map, respectively.
Fig. 2. An example image bug (bottom), with a contrast map (left) and classification map (right). In the contrast map, dark regions denote low values and light regions denote higher values. In the classification map, black regions denote texture, grey regions denote structure, and white regions represent edges. This information needs to be transmitted to the decoder such that quantization step-sizes may be determined. 3. CODING STRATEGY The proposed coding strategy is consists of two main stages: (1) segmentation and classification of image regions, and (2) coding of side information and quantized wavelet coefficients. These are explained in the following subsections, and are illustrated in Figure 3. Segmentation and Classification Scheme In order to generate contrast and classification maps, the image must be segmented into regions. One simple yet effective approach is to partition the image into regions of N × N pixel blocks. In practice, good performance is observed setting N = 16. The contrast map is computed by mapping pixels into the luminance domain and applying (1). Each region (block) must also be associated with a structural class to create the classification map. In the proposed implementation, a multi-stage approach is adopted. Figure 2 contains an example classification map produced by this approach. Each stage performs one type of classification, and only unclassified blocks are classified during subsequent stages. Classifier stages are now described in more detail. Let n index each block in the image σn2 denote the block variance. 1. The first stage classifies each block as a texture if the average squared value of the block entries (the energy) exceeds a threshold given by C1 · |{n|σn2 > TN }|, where C1 and TN are constants. (In practice, for N = 16, TN = 70 yields good results.) Thresholds for sub-
sequent stages are also of this form. 2. The second stage passes image data in all unclassified blocks through a Canny edge-detector and classifies each block as an edge if the average Canny detector output for the block exceeds a similar threshold. 3. The third stage uses another well-known edge-detection scheme, the Laplacian of Gaussian operator. In the same way, blocks where the average detector output exceeds a threshold are classied as edges. 4. Any non-texture blocks with average energy above a threshold are classified as containing edges. 5. All unclassified blocks are labeled as structures. Side Information and Wavelet Coefficient Coding Prior to coding, the image is segmented into regions, after which the contrast and classification maps are computed. An example of how this task can be performed is illustrated in the previous subsection, though any segmentation/classification schemes are valid. Image file header information (width, height, etc.) is then coded, followed by the classification and contrast maps. The classification map, which associates one of three class with each region, is losslessly coded with a simple arithmetic coder that is conditioned adjacent previously coded classes of adjacent regions. The contrast map, on the other hand, is coded lossily because it consists of a continuum of values (see Figure 2). In particular, it is compressed using a standard wavelet transform, quantizer and entropy coder. This classification/contrast map information is sufficient to generate spatially- adaptive step-sizes at both the encoder and decoder. A variety of methods are available to code the actual image data, which is transformed to the wavelet domain with 5-level 9/7 DWT. As in [7], the image may be compressed in a non-embedded fashion, where wavelet coefficient quantization indices are separated into a significance map, refinement bits, and sign bits. The locations of non-zero entries in each subband significance map are coded first (which is why the bit-stream is not refinable below the subband level), then the values of the entries are coded conditioned step-sizes used to create each entry. Refinement bits are represented with a simple adaptive arithmetic coder, and sign bits are inserted into the stream as needed uncoded. Another alternative is to generate the step-sizes, quantize the wavelet data, then simply compress the resulting quantization indices using a more traditional embedded wavelet coder. Both types of coding require the spatially-adaptive step-sizes to be derived from the (de)coded contrast map, to ensure invertability at the decoder. These two methods are compared in Table 1, which illustrates tradeoffs between achievable efficiency and robustness. 4. RESULTS Images are compressed at visually lossless rates with the proposed method and the coder in [7]. The main differences be-
image beans bug gazebo house rhino temple
embedded 1.387 1.032 1.281 1.233 1.854 0.840
overall rate (bpp) non-embedded 1.333 0.993 1.246 1.195 1.784 0.813
change - 0.054 - 0.039 -0.035 -0.038 -0.070 -0.027
Table 1. Comparison between the total coded rate achieved with an embedded and non-embedded implementation of the proposed coder. The gain in rate due to the non-embedded approach ranges between 0.03 and 0.07 bpp.
Fig. 3. Residual images resulting from compressing bug using the texture-masking-based approach in [7] (left) and the proposed structural-masking-based approach (right), which have been contrast-enhanced to emphasize differences.
tween these coders are that (1) the proposed coder requires additional rate to specify the results of the classification stage, such that the decoder can invertibly derive the quantization step-sizes from the contrast map, (2) the proposed coder derives step-sizes from TvC curves tailored to the structural class of each region (texture, structure or edge) instead of TvC curves for just for textures, and (3) the proposed coder segments images into 16×16 blocks instead of 8×8 blocks, since content classification, instead of a highly granular segmentation, can be used to ensure appropriate selection of step-sizes. Test images consist of six 512 × 512 8-bit greyscale images, collected from standard databases as well as the internet, illustrating a range of structural classes. For the tested images, the proposed coder demonstrates a reduction in overall coded rate by a factor of 3-10 %. Rates for individual test images are listed in Table 2, as well as the reductions achieved in bits-per-pixel (bpp). Reductions occur for two main reasons: first, the side rate associated with a 16 × 16 pixel block segmentation is smaller than that associated with an 8 × 8 pixel block segmentation, and second, the class-specific segmentation results in slightly higher contrast values and thus slightly more aggressive step-sizes for certain regions. At the same time, however, the proposed coder spends more bits representing certain regions classified as structures or edges (reflected in Figure 3). The PSNR val-
contrast map segmentation and classification
DWT
coded contrast map
entropy code
side information compression classification map
image data
dead-zone quantize
DWT-1
compute step-sizes
DWT
dead-zone dequantize
entropy decode
entropy code
code quantized wavelet coefficients (possibly conditioned on step-sizes)
dead-zone quantize
coded classification map
coded image data
wavelet coefficient compression
Fig. 4. Block diagram representing the proposed coder. The major components perform segmentation/classification, side information coding and wavelet coefficient coding. If coding quantized wavelet coefficients conditioned on step-size values, the most compression gain is achieved through a non-embedded coding strategy (see Table 1).
image
texture -based
beans bug gazebo house rhino temple
1.372 1.092 1.357 1.333 1.861 0.879
overall rate (bpp) texture/ change structure/ edge-based 1.333 - 0.039 0.993 - 0.099 1.246 -0.111 1.195 -0.138 1.784 -0.077 0.813 -0.066
texture -based 0.054 0.036 0.042 0.030 0.060 0.031
side rate (bpp) texture/ structure/ edge-based 0.022 0.017 0.019 0.015 0.022 0.016
change
texture -based
-0.032 -0.019 -0.023 -0.015 -0.038 -0.015
30.13 34.50 31.80 33.81 28.42 35.71
PSNR (dB) texture/ structure/ edge-based 30.21 34.66 31.53 33.48 28.09 35.93
change
+0.08 +0.16 -0.27 -0.33 -0.33 +0.22
preference (♯ viewers) texture texture/ -based structure/ edge-based 3 7 2 8 3 7 4 6 4 6 4 6
Table 2. Comparison between the total coded rate/side rate/PSNR achieved with the texture-based coder in [7] and the proposed classification-based coder. ues in the table corroborate this notion; sometimes the PSNR increases due to finer quantization, whereas sometimes it decreases due to coarser quantization. Results of a perceptual test comparing the two coders are also illustrated in 2. Ten non-expert viewers were presented with both images and asked which was of better quality. On average, 2/3 of the viewers preferred the images created with the proposed method. (Images from the test may be found at http://foulard.ece.cornell.edu/gaubatz/icip06/icip06.html.) Figure 3 illustrates a comparison between residuals, which show that the proposed coder places smaller errors in the regions corresponding to the structures on bug and larger errors in some edge regions that can mask more distortion. The fact that image quality is maintained, even when PSNR decreases, suggests that the proposed coder might be making more efficient use of structural masking properties. 5. REFERENCES [1] R. J. Safranek and J. D. Johnston, “A perceptually tuned subband image coder with image dependent quantization and postquantization data compression,” in Proc. ICASSP, May 1989, vol. 3, pp. 1945–1948.
[2] A. B. Watson, G. Y. Yang, J. A. Solomon, and J. Villasenor, “Visibility of wavelet quantization noise,” IEEE Tran. Im. Pro., vol. 6, pp. 1164–1175, 1997. [3] D. Chandler and S. S. Hemami, “Effects of natural images on the detectability of simple and compound wavelet subband distorions,” J. Opt. Soc. Am. A, vol. 20, pp. 1164–1183, 2003. [4] M. Gaubatz, D. Chandler, and S. S. Hemami, “Spatial quantization via local texture masking,” in Proc. SPIE HVEI, Jan. 2005, vol. 5666, pp. 95–106. [5] W. Zheng, S. Daly, and S. Lei, “Point-wise extended visual masking for j p e g-2000 image compression,” in Proc. ICIP, Sep. 2000, vol. 1, pp. 657–660. [6] I. Honstch and L. Karam, “Locally adaptive perceptual image coding,” IEEE Trans. Im. Proc., vol. 9, pp. 1472–1483, 2000. [7] M. Gaubatz, D. Chandler, and S. S. Hemami, “Spatiallyselective quantization and coding for wavelet-based image compression,” in Proc. ICASSP, Mar. 2005, vol. 2, pp. 209–212. [8] S. S. Hemami, D. Chandler, B. Chern, and J. Moses, “Supra– threshold visual psychophysics and structure-based visual masking,” in Proc. SPIE VCIP, Jan. 2006, vol. 6077, pp. 239–253. [9] R. Joshi, H. Jafarkhani, J. Kasner, T. Fischer, N. Favardin, M. Marcellin, and R. Bamberger, “Comparison of different methods of classification in subband coding of images,” IEEE Trans. Im. Proc., vol. 6, pp. 1473–1486, 1997.