Segmentation of Natural Images for CBIR Paul Stefan Williams and Mike D. Alder Centre for Intelligent Information Processing Systems Department of Electrical & Electronic Engineering The University of Western Australia Nedlands W.A. 6907, AUSTRALIA phone: +61-8-9380-3856 fax: +61-8-9380-1104
[email protected] [email protected] ABSTRACT This paper examines the problem of segmenting colour images into homogeneous regions for use in content based image retrieval (CBIR) or object recognition in general. Low level features provide intensity, colour and texture characteristics across the entire image. From these feature vectors a measure of local homogeneity is obtained. Through iterative modelling a seed and grow style algorithm is used to locate each segment. The nal segment models provide sucient information for higher level processing or classi cation. Segmentation and classi cation results are illustrated from a database of 1000 Corel Photo CD image.
keywords: Pattern Recognition and Analysis, Image Segmentation, Content Based Image Retrieval (CBIR), Syntactic Pattern Recognition, Image Recognition
1. Introduction The ability to perform image database queries becomes increasingly more important as collections of images grow large and more distributed. In order to be useful, such image retrieval systems must be able to perform automatic searches based on image content. Current systems tend to either be very speci c, based on object models or very general, based on global characteristics. The ability to locate horses [1] or naked people [2] using body plans or other frameworks is useful, although it requires speci c object models. In contrast commercial systems such as QBIC [3][4] perform searches based on global colour or textural information. A search for images matching an elephant on the current QBIC system1 produced images including a bicycle, a house in sunlight, a river and a monument at night. This is clearly far from ideal. One of the more sophisticated CBIR systems [5][6] locates connected segments of the same colour and texture, characterising an image by the size, colour, texture and relative location within the image (top left, Refer to \http://www.qbic.almaden.ibm.com/". It should be noted that this online system may be modi ed, updated and possibly improved without notice. 1
top center, top right, et cetera). The system rst transforms the image into 13 binary images, or colour channels. These are based on partitioning of the HSI colour cone and are hence disjoint. A further 6 binary images are obtained to characterise texture using the windowed image second moment matrix [6][7]. From these a set of connected blobs are further discretized according to size and relative position to form a single feature vector. Classi cation is performed using the Mahalanobis distance in this new feature space. This system has some obvious bene ts. It is general, making few prior assumptions about the content of the image, and makes use of local information. However, the method is susceptible to noise and changes in an images white balance due to the rst discretisation process. Furthermore, it is highly dependent on orientation and translation which causes blobs to move from one relative location to another. The ideal CBIR system should be general, making few prior assumptions about the content of the image. A balance between local characteristics and global structure inferred from contextual information is desirable. Tolerance to noise, changes in contrast and brightness, and sucient translational and rotational invariance are also required.
2. Approach In this paper a hierarchical approach is adopted which involves combining low level feature vectors into spatially disjoint regions and in turn into higher level objects. For example the simple image illustrated in Figure 1(a) contains four initial segments. A higher level of processing should combine the two sky regions since they mutually predict one another and are positioned close relative to their size. The segmentation process is intended to be general, assuming only that images are real world colour scenes of a known resolution. Most importantly the process is designed to be automatic and unsupervised, inferring structure from the information in the current image. In summary the method should have characteristics of automation, generality and hierarchical structure [8]. Dark Blue
Sky
Object 1
Sky
Blue
Cloud
Object 2
Cloud
Blue
Sky
Object 1
Sky
Water
Object 3
Water
Light Blue
(a) Image content.
(b) Ideal object segmentation.
Fig. 1: This example illustrates a common pattern recognition problem known as occlusion.
A generic feature extraction process, which is easily adaptable to image resolution, allows the local intensity, textural and colour characteristics to be suciently described. Previous work [8] shows this representation to be suitable for locating anomalous and hence homogeneous regions. Furthermore, it provides textural characteristics sucient to segment newspaper images into regions corresponding to background, edges, pictures and text [9].
Using these local feature vectors and this measure of homogeneity, spatially disjoint regions of the images are located. An initial homogeneous seed region is selected with the corresponding feature vectors modelled by low order geometric moments. A ood ll style algorithm is used to expand the region and incorporate neighbouring feature vectors agreeing with the model. The model is re ned on each iteration. This process is repeated with an increased tolerance to homogeneity. Once the segmentation process is complete each region is modelling using up to second order geometric moments. The model includes intensity, textural and colour characteristics from the feature vectors as well as spatial information. These regions are suitable for higher level processing or nal representation. At this stage appropriate metrics are required for image matching or training techniques for image classi cation. The remainder of this paper outlines the feature extraction process and segmentation algorithm in more detail. Segmentation results are discussed for select images from the 1000 segmentation images shown in [10]. Higher level processing and image classi cation are also outlined.
3. Feature Extraction The feature extraction framework used is described fully in [8] [9] [11] [12]. In this method a mask, as depicted in Figure 2, is used. The mask consists of a square set of blocks of size block size, or k. Each of these is an n n window, where the window size (n) governs the resolution of the feature extraction process. The mask is placed over the top left corner of the image and a window measurement is made using the 2 n pixels in each of the k2 windows. In addition, the spatial oset (x; y) of the mask is added forming a feature vector in Rk2+2 for a single one dimensional window measurement. This process is repeated for all possible mask placements across the image.2 Window Size n = 7
Block Size k=3
Mask Size n x k = 21
Fig. 2: Mask used to obtain local image characteristics.
Four window measurements are used in the feature extraction. The rst two being the average pixel intensity (i) and the one dimensional standard deviation of intensity () within the window. The other two represent the average colour (cx; cy) within the window.
2
For the values used (x = y = 1pixel). To increase speed and reduce memory load larger values can be used.
In summary the feature extraction process maps an image in R5 (red, green, blue, x position, y position) into a series of points in R4k2+2 .
I[r; g; b; x; y] 7?! P[i1; 1; cx1; cy1; i2; :::; ik2; k2 ; cxk2 ; cyk2 ; x; y] 3.1. Colour Measurements The HSI colour scheme is used in preference to RGB, CIEXYZ [13] [14] and other standard colour representations. HSI is more suitable than RGB since it separates the intensity and colour components. Furthermore, the hue-saturation plane is linear and hence supports averaging, the Euclidean metric and basic statistical methods.3 Figure 3(a) shows the distribution of colours in the hue-saturation plane as taken from over 1000 outdoor scenes. As illustrated here purely red and purple colours are uncommon in nature as are lurid greens. Since the angular hue component of HSI is chosen arbitrarily with red at zero degrees, it is rotated according to this natural correlation. Furthermore the space is stretched to provide more distinction between green and magenta. The nal colour measurements represent a linear transformation of the hue-saturation plane as shown in Figure 3(b). Yellow 0.2
Green
Cyan
Red 0.0
Col Y
0.0
Col2
0.1
0.2
Green
Blue
-0.1
-0.2
Red
Blue
Magenta -0.5
-0.5
0.0
0.0
0.5 Col X
Col1
(a) Hue-saturation plane.
(b) Adopted colour measurements.
Fig. 3: The distribution of the hue-saturation plane for natural images, with red at 0, green at 120 and blue at 240 degrees.
The HSI colour scheme can be represented using either a colour cylinder or a colour cone [14]. In the cylindrical colour space saturation values are independent of intensity, thus allowing extremely dark colour to be assigned unreasonably high saturation values. On the other hand, the more popular conical colour space limits the saturation linearly with intensity. A tradeo is used in which the saturation is limited by 1 ? (1 ? intensity)N 4 for N = 2. This can geometrically be viewed as a parabola rotated around the y-axis. The CIEXYZ colour space is highly nonlinear, requiring special formula to obtain the perceptual distance between two colours. It also requires a white reference. 4 Using N = 1 is equivalent to the HSI cone , whereas N = in nity corresponds to the HSI cylinder. 3
4. Image segmentation The following image segmentation method is based on that described in detail in [11].
4.1. Determining local change First consider the feature extraction process outlined in Section 3 using a single window measurement. In this case, a small region with homogeneous characteristics will produce a feature vector p 2 P which lies close to the vector ~1 = [1; 1; : : : ; 1], regardless of dimension. Hence, the orthogonal distance d from p to ~1 provides a sucient measure of local change, i.e. d = jjp ? l~1jj where l = jjproj~1pjj (l is the length of the projection of p onto ~1) [9]. This simple measurement produces comparable results to modelling the entire feature space with low order geometric moments [8] using only a fraction of the time. Extending this measure for feature vectors of multiple window measurements requires that individual window measurements have comparable magnitude and deviation. To cater for this the average intensity, local standard deviation and average colour measurements are scaled accordingly.
4.2. Segmentation algorithm Using the feature extraction process outline in Section 3 and the measure of local change outlined above the segmentation algorithm consists of the following steps.
Label each feature vector as not belonging to any class or object identi er. This label is referred to
as unused. Tentatively identify homogeneous regions allowing a low tolerance to local change. An initial local change threshold is used here. Model spatially disjoint objects which are large enough. Each region is modelled according to shape, position and feature vector characteristics using geometric moments. For each object, in decreasing size order, perform an iterative ood ll algorithm. This involves examining each feature vector spatially adjacent to the modelled region with respect to the object's model. Vectors agreeing with the object model are added and the model is appropriately re ned. The process is repeated until no more surrounding vectors can be added. Label the nal model with a new object identi er excluding the region from further classi cation. Models which are not large enough are labelled unused. Repeat the entire process from the second step with more tolerance to local change.
The idea behind this method is to locate large homogeneous and spatially disjoint objects, model these and continue to look for smaller ones. The initial tolerance to local change is made extremely low to ensure smooth object boundaries are not crossed. Once an object has been modelled, the tolerance can be increased since the object's model prevents boundary crossings. Ideally objects should be separated by thin unused regions corresponding to the edges or areas between spatially disjoint objects.
4.3. Implementation Feature extraction is performed using a window size n = 6, a block size k = 3 and mask shifts x = y = 1. Window measurements include the average intensity, standard deviation of intensity and two average colour measurements forming a set of points in Rk2 +2 which is alternatively viewed as a set of points in Rk2 indexed by position (x; y). The colour measurements make use of the HSI parabolic space with an additional rotation of 0.545 radians5 as outlined in Section 3.1. The four window measurements are scaled by 1.0, 4.0, 415 and 1225 respectively. An initial local change threshold of 5 is used which is increased by 2 until it exceeds 80. For each iteration a simple blob nding algorithm identi es all disjoint seed regions greater than 400 pixels. These regions are modelled and iteratively grown and re ned. The model for each seed region consists of an area (or number of data points), a mean ( 2 R6 ) and a positive semi de nite covariance matrix (C, a 6 by 6 matrix). These are obtained using geometric moments on the 6 window measurements for all window placements in the seed region. The ood ll algorithm uses the Mahalanobis distance with respect to this model to decided whether spatially neighbouring data points should be included. The Mahalanobis threshold is dynamic, being inhibited by the spread of the object in the feature space. The threshold used is maxf5:5 ? 0:1max; 0g where max is the largest window measurement standard p deviation as obtained from the covariance matrix, i.e. max = maxf Cii : i = 1 : : : 4g. This inhibition prevents regions with gradual change from leaking into adjacent regions. The continual model re nement also ensures that the characteristics of the new data points agree with the entire region. This process takes approximately 2 minutes on a single R10000 processor, although time varies between images based on the number of identi ed regions.6 The nal region models and a quad-tree characterising shape are stored to disk.
4.4. Higher level processing Using the model for each region (; C) the expected value of the window measurement at a restricted (x; y) position can be determined. For example in the trivial case, the expected intensity and colour at the region's centroid are obtained directly from the mean. Currently the higher level processing examines each pair of regions (A and B). These two are assigned the same label if the model for A can be used to predict the characteristics at the mean of B and visa versa.
5. Results The results of the segmentation algorithm are illustrated on 1000 Corel Photo CD images, including those shown in Figure 4. These images are also used in [5][6]. For each image the progressive identi cation of regions is available online [10]. 5 6
Obtained from the orientation of the the major axis in Figure 3. The model re nement after the addition of each feature vector takes the majority of this time even though running sums
Fig. 4: Corel images classed as airshows, elds, deserts, polar bears and brown bears.
Four example results are shown in Figure 5. The original image is shown on the left, the segmentation in the middle and the result after high level processing on the right. As outlined previously the intention is to use low level features to form simple spatially disjoint regions which can in turn be combined into larger or higher level objects. Hence it is preferable to identify smaller regions rather than risk joining two regions into one. The segmentation of the bald eagle produces small unlabelled regions at the edges of major homogeneous regions (shown in white in Figure 5). This phenomena is desirable as outlined in Section 4.2. In the nal results, parts of the bird's wing and parts of the bird's neck have been combined as well as the water in the background. This is a good example of the eectiveness of both the region's model and the prediction method. In this case although the characteristics of the water changes from the bottom to the top, the region's model captures this change allowing the three regions to be combined. The segmentation of the polar bear image produces three sizeable regions corresponding to the snow, shrubs and polar bear. The rock image produces four major segments. Small lighting variations cause the rock to be split into two regions. Remodelling and improved high level prediction methods should be able to combine these. More importantly the shading of the sky varies non linearly and hence is not combined into a single region. and sums of squares are maintained during the ood ll.
This outlines the limitation of modelling the region's characteristics using up to second order geometric moments.
(a) Original
(b) Initial segmentation
(c) Final segments
Fig. 5: Results for images from the classes bald eagle, polar bear, desert and cheetah.
The nal image illustrated a more fundamental limitation of the current algorithm. The nal result obtains regions for the background, foreground, shadow and three regions of the cheetah (combined from four regions). In this image the cheetah is not recognised as a single homogeneous region, although it does appear so to a human. Using a larger resolution (i.e. n > 6) and hence multiresolution analysis is one way to overcome this problem. Alternatively, noting the presence of a large unlabelled region it is reasonable to model it as a mixture of two or more overlapping regions. Such modelling would also naturally identity a tree's leaves against a blue sky. Such modelling has been left for future work.
6. Classi cation As clear from Figure 4 the classi cation of images into their supposed classes7 is highly non trivial. Showing a graduate student all 1000 images in random order yielded classi cation results of 94.4% despite results of 100% for airshows, polar bears, brown bears, elephants, tigers, cheetas and bald eagles. These results are clearly in uenced by sophisticated higher level processing and human knowledge. This demonstrates that the image classes themselves are subjective and consequently extremely high classi cation results cannot be expected or even justi ed. Currently an extremely simple classi cation method is used. Each image is modelled by identifying the homogeneous regions closest to the center of the image which is sizeable (more than 7500 pixels). This region forms a single feature vector which includes average intensity, intensity deviation, average colour and spatial spread (x; y ). Classi cation using k-nearest neighbours with k = 3 and a weighted Euclidean distance produce results of 50.5% as shown in Table 1. Using k = 1 yields 55.1%. Image class AS PB Airshows 79 6 Polar Bears7 37 40 7 Brown Bears 2 0 Elephants 4 2 Tigers 0 1 Cheetas 3 1 Bald Eagles 8 2 Canadian Rockies 11 1 Fields 8 2 Deserts 8 3 Sunsets 5 3
BB 0 0 57 11 26 9 13 0 2 7 3
E 2 0 8 63 12 9 5 6 2 6 13
Classi cation T C BE CR F D S 1 0 1 1 6 3 1 6 0 3 0 9 6 0 12 0 9 0 5 8 0 5 2 2 0 2 9 0 29 2 0 3 14 13 0 6 49 0 0 7 16 0 3 0 56 1 3 9 0 5 0 5 35 20 16 1 4 2 1 3 61 14 1 11 6 2 1 19 35 2 4 3 4 6 5 7 47
Table: 1: Classi cation results for 1000 Corel Photo CD images.
Classi cation results in [5][6] also obtain results between 50% and 55%, although inappropriate images are selectively omitted or rotated.
7. Conclusion The segmentation algorithm presented here is general, providing a good framework for CBIR of natural images. The algorithm is adaptable to image resolution and local features or window measurements. Although the segmentation and classi cation results are promising more work is needed to eectively recognise and The classes are assigned based on the Photo CD title. The \Bears" class has been split into the rst 35 images (polar bears) and the remaining 65 (brown bears). For all other classes all 100 images are used. 7
describe the objects or contents of the image. Multiresolution analysis and the use of mixture models are suggested to improve segmentation.
References [1] D.A. Forsyth and M. Fleck. Body plans. In Proceeding of IEEE Conference on Computer Vision and Pattern Recognition, Puerto Rico, June 1997. [2] D.A. Forsyth M.M. Fleck and C. Bregler. Finding naked people. In Proceedings of ECCV, 1996. [3] J. Ashley, R. Barber, M. Flickner, J. Hafner, D. Lee, W. Niblack, and D. Petkovic. Automatic and semi-automatic methods for image annotation and retrieval in QBIC. Technical Report RJ 9951-87910, IBM Almaden Research Center, April 1995. [4] P. Aigrain, H. Zhang, and D. Petkovic. Content-based representation and retrieval of visual media: A state-of-the-art review. In Multimedia tools and applications, volume 3, pages 179{202. Kluwer Academic Publishers, 1996. [5] H. Greenspan S. Belongie, C. Carson and J. Malik. Recognition of images in large databases using color and texture. In CVPR '97, 1997. [6] H. Greenspan S. Belongie, C Carson and J. Malik. Recognition of images in large databases using a learning framework. Technical Report 939, Computer Vision Group, UC Berkeley, 1997. [7] S. Belongie. Demonstration of the windowed image second moment matrix. Available online at \http://http.cs.berkeley.edu/~sjb/smm.html/", 1996. [8] P.S. Williams and M.D. Alder. Generic aw detection within images. In Proceeding for ICNN95, volume 1, pages 141{145, Perth, November 1995. [9] P.S. Williams and M.D. Alder. Generic texture analysis applied to newspaper segmentation. In Proceeding for ICNN96, volume 3, pages 1664{1669, Washington DC, June 1996. [10] P.S. Williams. Corel images and segmentation results. Available online via \http://ciips.ee.uwa.edu.au/~williams/research/segmentation/", 1997. [11] P.S. Williams and M.D. Alder. Image segmentation using an entropic UpWrite. Technical Report TR97-03, CIIPS, Dept of EE Engineering, The University of Western Australia, http://ciips.ee.uwa.edu.au/Papers/, 1996. [12] P.S. Williams. The complete picture: An online demonstration. Available online at \http://ciips.ee.uwa.edu.au/~williams/research/complete1995/", 1995. [13] R.S. Gray. Content-based image retrieval: color and edges. Technical Report 252, Dartmouth College Computer Science Department, March 1995. [14] L. MacDonald R. Jackson and K. Freeman. Computer Generated Color. John Wiley and Sons, 1994.