Unsupervised Multiscale Image Segmentation Alvin H.Kam and William J.Fitzgerald Signal Processing Laboratory University of Cambridge Engineering Department Trumpington Street, Cambridge CB2 1PZ, United Kingdom ahswk2,
[email protected] Abstract We propose a general unsupervised multiscale featurebased approach towards image segmentation. Clusters in the feature space are assumed to be properties of underlying classes, the recovery of which is achieved by the use of the mean shift procedure, a robust non-parametric decomposition method. The subsequent classification procedure consists of Bayesian multiscale processing which models the inherent uncertainty in the joint class and position domains via a Multiscale Random Field model. At every scale, the segmentation map and model parameters are estimated by sampling using Markov Chain Monte Carlo simulations. The method is applied to perform colour and texture segmentation with good results.
1. Introduction The segmentation of an observed image into an unknown number of distinct and in some way homogeneous regions is a difficult problem and remains a fundamental issue in low-level image analysis. Many different methodologies has been proposed but a process that is highly unsupervised, robust and flexible has yet to be realised. In this paper, we propose a general unsupervised multiscale approach for feature-based image segmentation. The method is unsupervised in that the number of classes are not known a priori and general in the sense of being independent of the feature extraction process. Consequently, the algorithm can be applied to perform different types of segmentation without modification, be it grey-scale, texture or colour based. The only requirement is for the input image to be representable in a suitable n-dimensional Euclidean feature space. Significant features which correspond to clusters in this space, are assumed to be representations of underlying classes, the recovery of which is to be achieved using the
mean shift procedure [4], a kernel-based non-parametric decomposition method. The next step in the algorithm involves the problem of classification. Classification is a tricky issue in image segmentation problems as there is an inherent uncertainty in the simulteneous specification of class and location. This has been shown to be a consequence of the relationship between the signals of which images are composed and the symbolic descriptions, in terms of classes and properties, which are the output of the segmentation process [12]. These effects of uncertainty can be minimised by the use of representations employing multiple scales.
2. The Mean Shift Procedure Let {Xi }i=1...n be the set of n data points in a ddimensional Euclidean space Rd . The multivariate kernel density estimate obtained with kernel K(x) and window radius h, computed at point x is defined as: n x − Xi 1 X K fˆ(x) = nhd i=1 h
(1)
The optimum kernel, yielding minimum integrated square error is the Epanechnikov kernel: KE (x) =
(
1 −1 2 cd (d
0
+ 2)(1 − xT x) if xT x < 1
(2)
otherwise
where cd is the volume of a unit d-dimensional sphere [10]. The use of a differentiable kernel allows one to define the estimate of the density gradient as the gradient of the kernel density estimate: n x − Xi 1 X ˆ ˆ ∇K ∇f (x) ≡ ∇f (x) = nhd i=1 h
(3)
For the Epanechnikov kernel (2), the density gradient estimate in (3) becomes: d + 2 n x 1 ∇fˆ(x) = n(hd cd ) h2 nx
X
Xi ∈Sh (x)
(Xi − x) (4)
where the region Sh (x) is a hypersphere (uniform kernel) of radius h, having the volume hd cd , centred on x and containing nx data points. The term in the square brackets in (4): Mh (x) =
1 nx
X
(Xi − x)
(5)
Xi ∈Sh (x)
is called the sample mean shift, the basis of the mean shift procedure. The quantity n(hndxcd ) is simply the kernel density estimate fˆ(x) computed with the uniform kernel, Sh (x) and we can hence rewrite (4) as:
∇fˆ(x) = fˆ(x)
d+2 Mh (x) h2
(6)
which yields: Mh (x) = k × ∇fˆ(x) where the proportionality constant, k =
3. The Multiscale Random Field Model We shall assume the validated cluster centres from the mean shift procedure to be manisfestations of underlying class properties for our image segmentation task. We proceed with a supervised multiscale Bayesian classification algorithm using the Multiscale Random Field (MSRF) model [1]. In this model, let the random field Y be the image that must be segmented into regions of distinct statistical behaviour. The behaviour of each observed pixel is dependent on a corresponding unobserved label site in X. The dependence of observed sites on their label is specified through the probability p(Y = y|X = x), the likelihood function. Prior knowledge about the size and shapes of regions will be modelled by the prior distribution p(X). X is modelled by a pyramid structure multiscale random field. X (0) is assumed to be the finest scale random field with each site corresponding to a single image pixel. Each site at the next coarser scale, X (1) , corresponds to a group of four sites in X (0) . And the same goes for coarser scales upwards. Thus, generally, the multiscale segmentation labelling is denoted by the set of random fields, X (n) , n = 0, 1, 2, .... The main assumption made is that the random fields form a Markov Chain from coarse to fine scale, that is:
(7) h2 . ˆ (d+2)f(x)
Equation (7) shows the mean shift vector being proportional to the gradient of the density estimate at the point it is computed. As the vector points in the direction of maximum increase in density, it can define a path leading to a local density maximum (i.e. to a mode of the density). It also exhibits a desirable adaptive behaviour, with the mean shift step being large for low density regions and decreases as x approaches a mode. The mean shift procedure consists of successive computation of the mean shift vector, Mh (x) and translation of the window Sh (x) by Mh (x). Each data point thus becomes associated to a point of convergence which represents a local mode of the density in the d-dimensional space. Conventional mean shift procedure has a complexity of O(n2 ) for a set of n data points, which is not desirable for practical applications on images. A more realistic approach is proposed in [3] whose complexity is of O(mn), with m ≪ n. This algorithm automatically determines the number and location of cluster centres by working on a reduced set of points randomly selected from the data and obeying certain constraints. The final decomposition depends on the selected window radius, h, which is controlled by top-down a priori information about the task.
p X (n) = x(n) |X (l) = x(l) , l > n = p X (n) = x(n) |X (n+1) = x(n+1)
(8)
In other words, it is assumed that for X (n) , X (n+1) contain all relevant information from previous coarser scales. It shall also be assumed that the labelling of sites at a particular scale is dependent only on the labelling of a local neighbourhood at the next coarser scale. This relationship and the chosen neighbourhood structure are depicted is Figure 1.
3.1. Sequential MAP Estimation To minimise the average cost of an errorneous segmentation, one has to solve the following optimisation problem: x ˆ = arg min E (C(X, x)|Y = y) x
(9)
where C(X, x) is the cost of estimating the ‘true’ segmentation X by the approximate segmentation x. The cost function should ideally assign progressively greater cost to segmentations with larger regions of misclassified pixels. To achieve this goal, the following cost function has been pro-
Figure 1. Blocks 1, 2, 3 and 4 are of scale n and have a common parent at scale n + 1 which they are dependent on. The arrows show additional dependence on their parent’s neighbours.
posed [1]: L
CSMAP =
where Cn (X, x) = 1−
1 X n−1 + 2 Cn (X, x) 2 n=0 L Q
δ X
(i)
i=n
(i)
−x
(10)
where the index s denotes individual sites at scale n, ys represents the ‘averaged’ feature vector of observed site Ys and (n) xs correspond to segmentation labels which have values taken from Λ = {1, 2, ..., c}, where c is the total number of (n) (n) classes. We choose to model p(Ys = ys |Xs = xs ) as a single-sided Gaussian distribution function: p Ys = ys |Xs(n) = x(n) = s 2 1 2 p (13) exp − 2 [∆ys ,x(n) ] s 2σn 2πσn2 where ∆ys ,x(n) represents the dissimilarity measure (Eus clidean distance) between the feature vector ys and the rep(n) resentative feature vector of label xs . The variance parameter σn typically increases with segmentation resolution, which agrees with the increased class uncertainties at finer scales. From our assumptions of the class label X: p X (n) = x(n) |X (n+1) = x ˆ(n+1) = Y (n+1) (n+1) = xˆδs (14) p Xs(n) = x(n) s |Xδs s∈S (n)
and L, the coars-
est scale of the multiscale pyramid. Using the MSRF model and ignoring the contribution of a second order error term, the solution is given by the following equation: arg max p Y = y|X (L) = x(L) x(L) for n = L p X (L) = x(L) (n) x ˆ = arg max p Y = y|X (n) = x(n) x(n) for n < L p X (n) = x(n) |X (n+1) = x ˆ(n+1) (11) The algorithm is initialised by determining the maximum a-posteriori (MAP) estimate of the coarsest scale field given the image Y . The MAP segmentation at each finer scale, x ˆ(n) is then found by computing the MAP estimate of X (n) , given x ˆ(n+1) and the image Y , hence the name sequential MAP estimator. For our experiments, we assumed a uniform prior for X (L) but in general, any suitable priors may be used.
3.2. Likelihood and Prior Probability Functions We will assume that at a particular scale, the observed sites are conditionally independent given their class labels: Y p Ys = ys |Xs(n) = x(n) p Y = y|X (n) = x(n) = s s∈S (n)
(12)
where δs denotes the neighbourhood structure shown in Figure 1. We choose the following form for the right-handside term above: (n+1)
p(Xs(n) = m|Xδs0
(n+1)
= i, Xδs1 = j, αn (n+1) (n+1) (3δm,i + 2δm,j + Xδs2 = k, Xδs3 = l) = 9 1 − αn 2δm,k + 2δm,l ) + (15) c where δm,n represents the unit delta function. The scale dependent parameter αn ∈ [0, 1], determines the probability that the class label of the fine scale site remains the same as that of one of the coarser scale local neighbourhood. Conversely, 1 − αn is the probability that a new class label will be randomly chosen from the remaining classes.
4. Parameter Estimation As the model parameters, σn and αn are generally unknown, they have to be estimated before any segmentation at a particular scale could be performed. Unfortunately, these parameters are dependent on the segmentation to be performed. To circumvent this ‘chicken and egg’ problem, the Metropolis-Hastings algorithm [6], [9], a Markov Chain Monte Carlo (MCMC) sampling procedure is used in a predetermined sequential scan in order to update the segmentation map and the model parameters in a specific order. The stochastic relaxation process of simulated annealing [5] is used.
4.1. Conditional Distributions The conditional distribution of the class label, from equations (12) and (14), is given by: p X (n) = x(n) |X (n+1) = x ˆ(n+1) , Y = y, σn , αn ∝ Y p Ys = ys |Xs(n) = x(n) , σ n s s∈S (n)
(n+1) (n+1) p Xs(n) = x(n) =x ˆδs , αn s |Xδs
(16)
where the distributions for the likelihood and prior terms are given by equations (13) and (15) respectively. For the sampling of σn and αn , the respective conditional distributions are: σn : p σn |X (n) = x(n) , Y = y ∝ Y p Ys = ys |Xs(n) = x(n) s , σn s∈S (n)
p σn |Xs(n) = x(n) s
αn : p αn |X (n) = x(n) , X (n+1) = xˆ(n+1) ∝ Y (n+1) (n+1) p Xs(n) = x(n) =x ˆδs , αn s |Xδs
(17)
s∈S (n)
(n+1) (n+1) p αn |Xδs =x ˆδs
(18)
with the likelihood terms given by equations (13) and (15) for σn and αn respectively. Non-informative or reference priors [8] were used for all experiments. The computational overhead for the estimation of σn and αn can be greatly reduced by subsampling sites at high resolutions. Since the number of sites at high resolutions is large, this subsampling procedure has negligible effect on the the accuracy of the estimated parameters.
4.2. Metropolis-Hastings Algorithm The conditional distributions of the segmentation map and the model parameters are difficult functions to maximise because they are multimodal and the vast combined parameter space are composed of both continuous and discrete subspaces. The Metropolis-Hastings algorithm is a robust MCMC optimisation algorithm which is ideally suited to be applied to these types of problem. Let the probability density function (to be maximised) be given in the functional form, y = p(x) where x ∈ Φ. The algorithm explores the parameter space Φ by means of
a random walk. Let Xi be the ith element of such a random walk. It is proposed that the next variate in the random sequence be Yi which is produced by adding a random perturbation ζ to Xi where ζ is drawn from the proposal density s(ζ). There are two possibilities for the choice of the next variate in the random sequence: (i) Xi+1 = Yi , i.e. to accept the proposed variate, or (ii) Xi+1 = Xi , i.e. to reject the proposed variate. The probability of accepting Yi instead of the current value is given by the Metropolis-Hastings acceptance function: A(Xi , Yi ) = min(1, Q(Xi , Yi ))
(19)
where: Q(Xi , Yi ) =
p(Yi )T (Yi |Xi ) p(Xi )T (Xi |Yi )
(20)
The conditional probability T (Yi |Xi ) is identical to s(ζ) and hence T (Xi |Yi ) is given by s(−ζ). Let ξ be a uniform random variate drawn over the range [0,1]. If the condition Q(Xi , Yi ) > ξ holds, then the next term term in the random sequence is Xi+1 = Yi , otherwise we have Xi+1 = Xi . If the proposal density is symmetrical about the origin, the probability of acceptance is independent of s(ζ). Thus, a position of lower probability will always be exchanged for one of higher probability but a transition in the reverse direction depends on a favorable outcome to a random event simulated using a uniform random variate. This allows the algorithm to escape from local maximas while striving for the global maximum of p(x). Simulated annealing is readily incorporated into the algorithm by modifying the Metropolis-Hastings acceptance function (20) such that:
QSA (Xi , Yi ) =
1 p(Yi )T (Yi |Xi ) Tt p(Xi )T (Xi |Yi )
(21)
where Tt is the annealing temperature at iteration t of the algorithm. At initial high temperatures, the probability of acceptance is very high but it reduces with the gradual cooling of the annealing temperature to reach the global maximum at very low temperatures. There has been much debate of how convergence might relate to the annealing schedule used. Theoretically, the logarithmic schedule derived in [5] is guaranteed to converge in infinite time. In practice, this is not implementable and we have adopted a linear schedule which produces fairly robust convergence in a relatively short time. The complete algorithm is applied to perform the challenging tasks of colour and texture segmentation.
5. Colour Segmentation Colour is a useful property which adds significant information about the physical properties of objects and aids tremendously in segmentation of images. Colour correlates with the class identity of an object because pigments form part of the appearance of an object and thus provide vital cues for identification, classification and segmentation purposes. The perceptually uniform CIE L*a*b* space is used to represent colour features. It is generated by linearly transforming the RGB colour space to the XY Z colour space followed by a non-linear transformation. Figure 2 shows some colour segmentation results. For the ‘hand’ image, the algorithm is able to easily distinguish the human hand and the blue doughnut-like object from the textured background. For the more difficult ‘jet’ image, the toy jet-plane, its shadow and the background are picked out by the algorithm despite the considerable colour variability of each object. Segmentation of the ’parrot’ image reveals a meaningful partitioning of the image. Generally, the well-defined contours reflect the excellent boundary tracking ablility of the algorithm while smooth regions of homogeneous behaviour are the result of the multiscale processing approach.
6. Texture Segmentation Texture is associated with intrinsic properties of surfaces, especially those that don’t have a smoothly varying intensity. It includes intuitive properties like roughness, granulation and regularity, all of which provide vital cues in image segmentation problems. Figure 3 illustrates a texture feature extraction model. Basically x(m, n) is the input texture image which is filtered by h(k, l), a frequency and orientation selective filter, the output of which passes through a local energy function, f (.) to produce the final feature image, v(m, n). Basically, the purpose of the filter, h(k, l), is extraction of spatial frequencies (of a particular scale and orientation) where one or more textures have high signal energy and the others have low energy. A quadrature mirror wavelet filter bank, used in an undecimated version of an adaptive tree-structured decomposition scheme [2], perform this task for our experiments on textures. For the local energy function, it has been found that squaring in conjuction with the logarithm after smoothing to be the best operator pair for unsupervised segmentation from a set of tested operator pairs [11]. For this reason, this operator pair is used for our experiments. The Gaussian lowpass filter is ideal as the smoothing filter with its joint optimum resolution in the spatial and spatial frequency domains. To determine the filter size, the spread, σs , may be
Figure 2. Randomly selected colour segmentation results. Original images of ‘hand’, ‘jet’ and ‘parrot’ are on the left and the corresponding segmentation on the right.
set as a function of the band centre frequency, f0 . With f0 being normalised (− 21 ≤ f0 ≤ 21 ), it has been suggested [7] 1 . This smoothing filter is also scaled so as that σs = 2√2|f 0| to produce unity gain in order for the mean of the filter’s output to be identical to that of its input. The final feature space for texture segmentation consists of two dimensions of textural features and one dimension of luminance. The textural features are orthogonal and obtained by performing a principal component analysis on the raw wavelet features. Some texture segmentation results are shown in Figure 4. For the ‘brodatz’ image, the algorithm is able to distinguish all 5 textures of the Brodatz texture mosaic and produced a highly accurate segmentation map. The segmentation of an aerial image of San Francisco, ‘sanfran’, produces welldiscrimated regions of the coastline, tilled areas, parks and the open sea while segmentation of the SAR image, ‘sar’,
Figure 3. Block diagram of a texture feature extraction model
depicts remarkable preservation of details as well as accurate boundary detection.
7. Discussion In this paper, we have proposed a general multiscale approach for unsupervised feature-based image segmentation. The method is general due to its independence of the feature extraction process, unsupervised in that the number of classes is not known a priori and robust through the use of the mean shift procedure and multiscale processing. The mean shift procedure has been proven to perform well in detecting clusters of complicated feature spaces of many real images. The Multiscale Random Field model makes effective use of the inherent trade-off between class and position uncertainty which is evident through the excellent boundary tracking performance. This multiscale processing reduces computational costs by keeping computations local and yet produces results that reflect the global properties of the image.
Figure 4. Randomly selected texture segmentation results. Original images of ‘brodatz’, ‘sanfran’ and ‘sar’ are on the left and the corresponding segmentation on the right.
References [1] C. Bouman and M. Shapiro. A Multiscale Random Field Model for Bayesian Image Segmentation. IEEE Trans. Image Process., 3(2):162–177, 1994. [2] T. Chang and C. Kuo. Texture Analysis and Classification with a Tree-Structured Wavelet Transform. IEEE Trans. Image Process., 2(4):429–441, 1993. [3] D. Comaniciu and P. Meer. Distribution Free Decomposition of Multivariate Data. 2nd Intern. Workshop on Statist. Techniques in Patt. Recog., Sydney, Australia, 1998. [4] K. Fukunaga and L. Hosteler. The Estimation of the Gradient of a Density Function, with Applications in Pattern Recognition. IEEE Trans. Info. Theory, 21:32–40, 1975. [5] S. Geman and D. Geman. Stochastic Relaxation, Gibbs Distributions and the Bayesian Restoration of Images. IEEE Trans. Pattern Anal. Machine Intell., 6(6):721–741, 1984. [6] W. Hastings. Monte Carlo Sampling Methods using Markov Chains and their Applications. Biometrika, 57:97–109, 1970.
[7] A. Jain and F. Farrokhnia. Unsupervised Texture Segmentation using Gabor Filters. Pattern Recognition, 24(12):1167– 1186, 1991. [8] H. Jeffreys. Theory of Probability. Oxford University Press, 1939. [9] N. Metropolis, A. Rosenbluth, and M. Rosenbluth. Equation of State Calculations by Fast Computing Machines. Journal of Chem. Phys., 21:1087–1092, 1953. [10] B. Silverman. Density Estimation for Statistics and Data Analysis. Chapman and Hall, London, 1986. [11] M. Unser and M. Eden. Non-linear Operators for Improving Texture Segmentation based on Features Extracted by Spatial Filtering. IEEE Trans. Syst., Man and Cyb., 20:804–815, 1990. [12] R. Wilson and M. Spann. Image Segmentation and Uncertainty. Research Studies Press Ltd., Letchworth, Hertfordshire, U.K., 1988.