Recognizing objects on cluttered backgrounds - Semantic Scholar

Report 11 Downloads 211 Views
Pattern Recognition Letters 26 (2005) 1512–1524 www.elsevier.com/locate/patrec

Recognizing objects on cluttered backgrounds Katarina Mele a

a,*

, Jasna Maver

a,b

Faculty of Computer and Information Science, University of Ljubljana, Trzˇasˇka 25, 1001 Ljubljana, Slovenia b Faculty of Arts, University of Ljubljana, Asˇkercˇeva 2, 1000 Ljubljana, Slovenia Received 1 March 2004; received in revised form 22 November 2004 Available online 5 February 2005 Communicated by E. Backer

Abstract This paper deals with recognition of known 3-D objects in different orientations on cluttered backgrounds. As a recognition technique we apply support vector machines (SVMs). To cope with the cluttered background a tree structure of masks is introduced for each object. SVMs are then computed by masking the training sets with the appropriate masks. One- and two-class SVMs are combined in the recognition process. One-class SVMs, used at the first stage, allow us to avoid the ‘‘non-object’’ class generation usually required to classify unknown objects or other parts of a scene. Two-class SVMs are further applied to resolve the recognition process when necessary. The proposed method is compared with two other approaches and as demonstrated by experimental results it is robust to cluttered backgrounds. The advantage of the method is its ability to classify the pattern as unknown which has a valuable effect on false positive rate.  2005 Elsevier B.V. All rights reserved. Keywords: Object recognition on cluttered background; Appearance-based object recognition; Support vector machines; Pattern recognition

1. Introduction

* Corresponding author. Present address: UNSW, School of Computer Science and Engineering, Anzac Parade, Sydney, NSW 2052, Australia. Tel.: +61 2 9385 6907; fax: +61 2 9385 5995. E-mail addresses: [email protected], katarinam@ cse.unsw.edu.au (K. Mele), jasna.maver@ff.uni-lj.si (J. Maver).

One of the challenging problems in computer vision is to learn objects and to recognize them under different conditions. Illumination, viewing angle, cluttered background, different scale, noise, and occlusions are factors that can affect recognition performance. The objective of the work presented in this paper is to recognize known 3-D

0167-8655/$ - see front matter  2005 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2004.12.003

K. Mele, J. Maver / Pattern Recognition Letters 26 (2005) 1512–1524

objects in different orientations1 on a clutter background. The problem is posed in the framework of view-based recognition. There is an extensive body of literature on view-base recognition. View-based methods differ with regard to applied type of classifier. Several classifiers have been proposed, e.g., minimum distance classification in the eigenspace (Nayar et al., 1996; Leonardis and Bischof, 2000; Murase and Nayar, 1995), Fishers discriminant analysis (Belhumeur et al., 1997), neural networks (Fleming and Cottrell, 1990) and SVM (Heisele et al., 2001; Pontil and Verri, 1998; Roobaert, 2001). The approach to recognition can be either global or component based. In the case of global approach a classifier is first trained by a set of feature vectors that represent the whole object images acquired from different viewing directions. Such an approach lacks robustness to occlusions, clutter and noise. An alternative to the global approach is to split the global feature vector to components and to design a recognition on components. Here we narrow our focus to the works closely related to our approach. In the work of Rizvi and Nasrabadi (2002) a region-based principal component analysis is applied. Targets are divided into four groups based on their size and shapes when viewed from different angles. Regions of interest are extracted with the use of representative silhouettes generated by each group. Each image in a group is also divided into several regions and a PCA is performed for each region to extract feature vectors. These feature vectors are then used to decide whether a potential target is a clutter or a real target. To manage the problem of cluttered background the idea of robust PCA as proposed by Leonardis and Bischof (2000) can also be applied. The robustness of their recognition process lies in the way the parameters representing the data are determined. Instead of computing the parameters by a projection of the data onto the eigenimages, they extract them by a robust hypothesize-and-test paradigm using subsets of image points. Many hypotheses are generated on randomly chosen subsets of image points. Competing hypotheses are then subject to a selection procedure based on the Minimum Description Length principle. 1

One-dimensional rotation in 3-D.

1513

The method proposed in this paper performs recognition by support vector machines (SVMs). In (Pontil and Verri, 1998) the authors study SVMs on images of 3-D objects from COIL-100 database. Tests were also performed on images corrupted by synthetically generated noise, moderate amount of occlusions, and on test images with the bias in the registration. The good recognition rates achieved in all performed experiments indicate that SVMs are well suited for view-based recognition. In (Roobaert, 2001) the author applies SVMs for recognition of objects on cluttered backgrounds. The work relies on pedagogical learning. Special training examples of backgrounds are proposed to overcome the effect of cluttered background. We adapted the BW method to color images. In our work we cope with the clutter by introducing a tree structure of masks for each object. One- and two-class SVMs are combined in the recognition process. One-class SVMs, used at the first stage, allow us to avoid the ‘‘non-object’’ class generation usually required to classify unknown objects or other parts of a scene. Two-class SVMs are further applied to resolve the recognition process when necessary. As demonstrated by experimental results the proposed method is robust to cluttered background. The advantage of the method is small false positive rate. This paper is organized as follows: In Section 2 we give a short theoretical background of SVMs. In Section 3 we pose the problem and propose tree structures of masks used in the process of learning of one- and two-class SVM hierarchies. Section 4 illustrates experimental results obtained on two different image databases. The proposed hierarchical principle of SVMs is compared with two other methods. At the end we give conclusions and some suggestions for future work.

2. Theoretical background 2.1. Support vector machine (SVM) SVM is a classification technique convenient for classification of high dimensional data and hence suitable for image classification. An image Xi of size p · r = n can be represented as a point xi in

1514

K. Mele, J. Maver / Pattern Recognition Letters 26 (2005) 1512–1524

Rn . Let {xi} be a set of points belonging to two different classes and let {yi} (yi 2 {1, 1}) be their class labels. SVM performs binary classification by optimal linear discriminant function f(x) = w Æ x + b. The parameters w and b are determined from empirical data, i.e., pairs {(xi, yi)} form a training set. SVM is defined as the function that separates two classes by a hyper-plane such that the distance to the closest data points, i.e., margin, is maximized. These closest points to the hyper-plane are called support vectors. Resultant hyper-plane is called optimal separating hyperplane. The linear SVM is given as: X f ðxÞ ¼ ai y i ðxi  xÞ þ b: ð1Þ i

The coefficients {ai} are determined by solving a quadratic programming problem (Vapnik, 1998). The value of b is computed given the {ai}, as b¼

maxy i ¼1

P

j aj y j ðxi

 xj Þ þ miny i ¼1 2

P

j aj y j ðxi

 xj Þ

:

A new point x is classified in accordance with the sign of f(x) (Eq. (1)), representing the side of the hyper-plane. SVM can also be applied to multi-class recognition problem (Cristianini and Taylor, 2000). To distinguish the known objects from unknown objects or other parts of a scene a class with the name ‘‘non-object’’ has to be introduced. All images that do not represent the known objects belong to the non-object class. To train SVM a representative non-object training set has to be prepared, which is a difficult task. An inappropriate training set leads to a large number of false positives. To avoid the difficulty of generating the non-object training set one can use one-class SVMs, described next. 2.2. One-class SVM One-class SVM is a classifier which is supposed to separate the data of one class from the data of all other classes. Let fxi g  X be data representing an object class. It is reasonable to assume {xi} to cluster in a certain way, but data out of that class usually do not cluster since they can belong to any class. Let U be a feature map X ! H; a map into a dot product space H such that the dot product

in the image of U can be computed by evaluating some simple kernel, k(x, y) = U(x) Æ U(y). The strategy is to map the training data {xi} into a features space via U, and construct a tight hyper-sphere with radius R and center c to describe the data in the features space. In (Chen et al., 2001) and (Scho¨lkopf et al., 2001) the authors formulate the problem and give the following solution. Data can be classified by the decision function: f ðxÞ ¼ sgn R2 

X ij

þ2

X

ai aj kðxi ; xj Þ !

ai kðxi ; xÞ  kðx; xÞ :

ð2Þ

i

The coefficients {ai} are determined by solving a quadratic programming problem. R2 is computed from Eq. (2) such that for any xi with ai > 0 the argument of the sgn is 0. The center of the P hyper-sphere is determined as c ¼ i ai  Uðxi Þ. The function f(x) (Eq. (2)) is positive inside the hyper- sphere and negative on the complement. We experiment only with a linear kernel.

3. Learning of 3-D objects with SVMs The approach to view-based learning of 3-D objects with SVMs as proposed in (Pontil and Verri, 1998) is as follows. The objects to be learned are individually placed on a turntable and images are taken for different orientations of the turntable. The acquired sets of images are then used to train SVMs. As we have seen in the previous section, SVM classifier treats images as points in Rn . The location of a point is determined by object pixels as well as background pixels. In the work of Pontil and Verri (1998) only the black background is considered. Cluttered background changes the zero values of black background and hence can move the point far away from the original location. Cluttered background therefore decreases the accuracy of the SVM classifier. To avoid the problem of clutter background the idea is to use only the data representing the object projection. The size and the shape of object projection change from view to view. Consequently,

K. Mele, J. Maver / Pattern Recognition Letters 26 (2005) 1512–1524

pixels at some locations in images may for same viewing directions belong to an object while for the other may belong to a background. Despite different object projections there usually exists an area of pixel locations in images of the training set that belongs only to object projections, i.e., object only region (OOR). This region can be small and therefore insufficient to design a reliable classifier. To consider also the other parts of object projections the idea is to recursively divide the training set on subsets and then to train the classifiers on OORs of each subset. The process of training set division is controlled by two parameters. The first parameter determines how to split the set on two subsets and the second parameter determines when to stop the process of division. During the process of training set division a tree structure of masks is built. Each mask represents the OOR of the current training set and is used for image masking. 3.1. Tree structures of masks Let {Xi} be a set of training images of one object and let us form a set of binary images {Bi} such that ( 1; if X i ðc; lÞ is an object pixel; Bi ðc; lÞ ¼ 0; if X i ðc; lÞ is a background pixel; c ¼ 1; . . . ; p; l ¼ 1; . . . ; r: VN An intersection mask I ¼ i¼1 Bi , a union mask WN U ¼ i¼1 Bi , and their area sizes sumðIÞ, sumðUÞ are computed for {Bi}. sum() gives the number of active pixels, i.e., pixels with value 1. The intersection mask I corresponds to OOR and determines the mask of the first level of the tree structure. The normalized difference of the size of union and intersection masks D¼

sumðUÞ  sumðIÞ sumðUÞ

ð3Þ

indicates the change in object projections in images of the training set. A large change requires a division of the training set on subsets and their separate processing. Hence, if the value of D, (Eq. (3)), exceeds a predefined threshold Tr, the image set is divided into two subsets and the new level

1515

Fig. 1. Object with a simple and complex 3-D shape. The projection of a torus remains the same from view to view, while the projections of a pelican are significantly different for same views. Hence, in the case of the torus the tree structure has only one level. The pelican requires the three level tree structure.

of tree structure is formed (Fig. 1). In our experiments the value of Tr was determined experimentally and was set to 0.25. It was kept constant for all objects. It is reasonable to assume that for each image in a set one of the images acquired from the neighboring viewing angles is the most similar. The assumption of similarity relates not only to the shape of object projection but also to the pixel values, i.e., to the color and texture of object surface. The above assumption simplifies the problem of set division. We are seeking the division of {Bi} on two subsets {Bi, Bi+1, . . ., Bj1} and {Bj, Bj+1, . . ., Bi1} such that the sum sumðBi ^ Biþ1 ^    ^ Bj1 Þ sumðBi _ Biþ1 _    _ Bj1 Þ þ

sumðBj ^ Bjþ1 ^    ^ Bi1 Þ sumðBj _ Bjþ1 _    _ Bi1 Þ

ð4Þ

is maximal, assuming circular behavior of indices i and j. At the same time it is required that each subset includes at least 1/3 of images of dividing set and at least a pre-specified minimal number of images. These additional requirements assure good balance of the tree structure and sufficient size of the training sets. The solution is obtained by performing i.e.,  n1  all possible  n set divisions,  n  m þ 1  n and  m þ 1  n  divi2 2 2 sions for odd and even number of images, respectively. n represents the number of images in the dividing set and m the required minimal number of images in each subset. New intersection and union masks are computed for each subset. The computed intersection masks form the level 2 of the tree structure. The quotient (3) is computed for each subset separately and if necessary, the level 3 is formed by further partitioning of image

1516

K. Mele, J. Maver / Pattern Recognition Letters 26 (2005) 1512–1524

Fig. 2. Tree structure of masks for a teapot.

sets. Fig. 2 represents such a tree structure of masks for the image set depicted in Fig. 3. 3.2. One-class SVM hierarchies Building the tree structure of masks includes the determination of masks as well as the determination of the training image sets used in the process of learning at each level of the tree structure. Next, one-class SVMs are computed. For each set of masked images a decision function (2) is determined. Each set of images is represented by its own hyper-sphere. Hence, for the tree structure in Fig. 2 we get a hierarchy of seven hyper-spheres. 3.3. Two-class SVM hierarchies We can not expect that the masked patterns of an object cluster in a spherical shape, therefore, the hyper-spheres may include also points that do not represent the particular object. At the some

time, different shapes of masks of different objects encompass different dimensions of Rn , i.e., different information in image. These are reasons that at the same place in image two or more objects can be recognized. To constrain the recognition at such locations two-class SVM hierarchies are introduced. All pairs of objects are considered in the following way: Let {cl, ck} and {Rl, Rk} be the centers and the radii of the hyper-spheres at the first level of the three structures of objects in a pair, respectively. It is reasonable to examine only the dimensions of Rn which are common to both hyper-spheres. Hence, a new intersection mask has to be computed by applying AND operator to the masks of the first level of tree structures. Let the new intersection mask encompass m dimensions of Rn , i.e., Rm . Hyper-spheres of both objects are projected to Rm and the projections are examined if they intersect: sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi m X 2 ðcki  cli Þ < Rk þ Rl :

ð5Þ

i¼1

If the inequality (5) is true, the object pair is critical. In this case additional two-class SVM hierarchy has to be constructed (Fig. 4). Images of both objects are masked with the new intersection mask and two class SVM is determined in accordance with (1). The procedure is then repeated on higher levels of the tree structures. New intersection masks are determined, the hyper-sphere projections are examined if they intersect and if it is so, the two class SVMs are computed.

Fig. 3. Acquired image set for a teapot.

K. Mele, J. Maver / Pattern Recognition Letters 26 (2005) 1512–1524

1517

Fig. 6. Hypothesis verification by two-class SVM hierarchies. The hypothesis is rejected.

Fig. 4. Separating two classes by one- and two-class SVMs.

against Ob2 confirms the hypothesis of Ob1 classifier Ob1 against Ob3 rejects it.

3.4. Object recognition by SVM hierarchies 4. Experiments The process of recognition is performed by moving the searching window over the test image. The current window is masked with the appropriate masks and then classified as a member of the class if the masked patterns lie inside the corresponding hyper-spheres. We start with the hyper-sphere at level 1 and then continue with higher levels. The pattern is a candidate for the known object if there exists a path in a tree structure of one-class SVMs from the root to one of the leaves which gives a consistent classification on all levels. Fig. 5 shows an example of a positive and negative recognition test. In the case that the object forms critical pairs with other objects the verification has to be performed also with the hierarchical structures of two-class SVMs. Let us demonstrate the procedure by an example. Assume that object Ob1 forms a critical pair with object Ob2 and object Ob3. The verification of the recognition result obtained by one-class SVM hierarchy for Ob1 also has to be done by two-class SVM hierarchies, i.e., Ob1 against Ob2 and Ob1 against Ob3. Fig. 6 gives an example of a recognition verification. While classifier Ob1

Fig. 5. Two examples of recognition. The tree on the left shows an example of a positive recognition test while the tree on the right shows an example of a negative recognition test.

In the experiments we compare the proposed SVM hierarchies with two other methods: BW method (Roobaert, 2001) and robust PCA method (Leonardis and Bischof, 2000), which were also designed to deal with cluttered background. We briefly described them in Sections 4.1 and 4.2. 4.1. BW method BW method relies on two-class SVMs. To overcome the problem of cluttered background additional training images can be generated; the original background can be replaced with all possible backgrounds. The number of all such images is enormous, therefore Roobaert (2001) proposes a data selection approach, so called pedagogical learning. He suggests to take as background values only extreme values. In practice this means that the training set consists of objects pasted on a black and white background. During experiments we realized that training with a black and white background is not sufficient for color images. We give the following illustration of the problem. The color image can be represented by two orthogonal vectors. The first vector is composed of object pixels while the second vector of background pixels. The background vector can be decomposed into three RGB color components. Assuming constant backgrounds of all possible colors the three background color components span a cube as illustrated in Fig. 7. The corners of the cube correspond to images of the following eight backgrounds: black, red, green, blue, yellow,

1518

K. Mele, J. Maver / Pattern Recognition Letters 26 (2005) 1512–1524

Fig. 7. The extension of BW method to color images. The training with a white and black background is not sufficient if the area belonging to object is small in comparison to the area ~ is orthogonal to of background. Note that the object vector OV all three color components.

cyan, magenta, and white. In the case that the number of object pixels is small and the number of background pixels is large, the distances between the corners of the cube are much larger than ~ . The black and the length of the object vector OV white background do not encompass the spread of points caused by the color background and hence the OSH computed only on a black and white background can eliminate the great portion of the color cube. The training set for BW method was therefore extended by images representing the corners of the color cube, i.e., by six additional color backgrounds. The BW method also requires the generation of non-object class. Multiple class SVMs were realized by applying the tennis tournament principle (Pontil and Verri, 1998).

accurate reconstruction of an object image, only a few eigenvectors are generally sufficient to capture the significant appearance characteristics of an object. These eigenvectors constitute what is referred to as the eigenspace (Nayar et al., 1996). In the eigenspace an image xi is approximated P by a ~i ¼ pi¼1 ai ei , linear combination of eigenvectors, x where p < N denotes the dimension of the eigenspace and N the number of images in the training set. The parameters a = (a1, . . ., ap) are computed by projecting theP vector x onto the eigenvectors, n ai ðxÞ ¼ ðx  ei Þ ¼ j¼1 xj ei;j ; 1 6 i 6 p, with n representing the number of all image pixels. The limitation of such an approach is a non-robust estimation of the parameters and therefore inability to cope with cluttered backgrounds. In (Leonardis and Bischof, 2000) the coefficients are computed in a robust way. Starting from randomly selected k pixels r = (r1, . . ., rk), the solution vector a 2 Rp is sought which minimizes the following over constraint system of equations in a least squares manner, !2 p k X X EðrÞ ¼ xr i  aj ðxÞejri : ð6Þ i¼1

j¼1

Then, based on the error distribution of the set of pixel values, the number of pixels is reduced by a factor of a and Eq. (6) is solved again with this reduced set of pixels. The process is repeated until all pixel values are either within the compatibility threshold H, or the number of pixels is smaller than x. Many hypotheses are generated using different subsets of k pixels. The best hypothesis is then selected according to the Minimum Description Length principle. In our experiments the selection of subset of pixels is additionally controlled by masking the images with intersection and union masks. The number of pixels k represents 95% of mask region.

4.2. Robust PCA Principal component analysis (PCA) is a well known image coding technique used extensively also for object recognition. It uses the eigenvectors of an image set as orthogonal bases for representing individual image in the set. Though a large number of eigenvectors may be required for very

4.3. Comparison of SVM hierarchies with extended BW method and robust PCA method Experiments were performed on two different sets of objects images: TOY database, prepared in our laboratory, and COIL-100 database. TOY database includes four objects: a teapot, a ghost,

K. Mele, J. Maver / Pattern Recognition Letters 26 (2005) 1512–1524

a mouse, and a pelican (Fig. 8). Even though the database is small, it represents a hard nut to crack due to a similar color of the teapot and the pelican, a small projection region of the pelican from some views, and a bent shape of the ghost. Objects in both databases are represented by 72 images acquired for different rotations of a turntable. For the training purpose 36 images (odd views) were selected. In the case of extended BW-method the training set was enlarged by pasting objects on different colored backgrounds, as described above, leading to 288 training images for each object. Additional non-object set of 477 images was generated from randomly chosen images of a scene backgrounds. As already demonstrated in (Pontil and Verri, 1998) the SVM classifier is non-robust to the bias in registration. Therefore, in the case of SVM hierarchies the training set was enlarged by shifting the original training images for one pixel in all 8 neighboring positions leading to 324 images per object. Eigenspace for robust PCA was constructed only from the basic training

1519

set. The dimension of eigenspace for TOY database was 8, capturing 74% of training set variance, while in the case of the set of selected objects from CIOL-100 database the dimension was 15, what represents 84% of training set variance. For SVM evaluation we use the Sequential Minimal Optimization method (Platt, 1998) from Statistical Pattern Recognition Toolbox for Matlab (Franc and Hlavac, 2000). We deal with color images of size 32 · 32 · 3 and color values in the interval [0, 255]. 4.3.1. Experiment 1 Here we experiment only with TOY database. Test images were prepared in the following way: all acquired images were extended by a frame of one pixel width. Then, the sets of images were enlarged by replacing the black background with 10 different color backgrounds. The total number of test images was 3168. Each test image allows nine different positions of a window of size 32 · 32. Table 1 gives the results.

Fig. 8. Recognition results on black and reddish background. The teapot, the ghost, the mouse, and the pelican were searched in the first, second, third, and fourth row, respectively. Squares show places where objects were localized and recognized.

1520

K. Mele, J. Maver / Pattern Recognition Letters 26 (2005) 1512–1524

Table 1 Results for extended BW method, robust PCA, and SVM hierarchies for TOY database on 10 different backgrounds

TP [%] FP [%] Unclassified [%] Time [s]

extBX

rPCA

SVMh

94.82 5.18 –

96.95 3.05 –

98.07 0.39 1.54

1.43

0.38

2.92

The last row gives the average executing time needed to process one position of a searching window. Tests were performed with non-optimized Matlab code on Mobile Pentium III 1000 MHz processor.

In the case of extended BW method we classify into five different classes: ghost, teapot, mouse, pelican, and background. SVM hierarchies and robust PCA classify into four different classes. While the robust PCA has to decide on one of the four known object, SVM hierarchies need not and allow the test image to be unclassified. Since in the experiment all of the test images represent the known objects, robust PCA has a little advantage over extended BW and SVM hierarchies. In the performed experiment SVM hierarchies have achieved the best results. In order to assess the robustness of the methods to noise, we added zero mean random noise uniformly distributed in the interval [n, +n] to the color values of each pixel of test images. The results are depicted in Table 2. All three methods preserve good recognition performance in the presence of small noise values

and degrade gracefully for larger noise values, where the smallest change is obtained for extended BW method and the largest for SVM hierarchies. We can also notice that in the case of SVM hierarchies the increase of noise size does not cause an increase of percentage of FP, instead it increases the percentage of unclassified examples. 4.3.2. Experiment 2 Here we test the ability of all three methods to localize and correctly recognize objects on a larger test image. The experiments are first done for objects from TOY database and next for a subset of 21 objects from COIL-100 database. 4.3.2.1. Example on TOY database. Images of objects from the TOY database were pasted on two different backgrounds: black and reddish (Fig. 8). In both test images the selected object views and their arrangement are the same. A searching window was shifted over the test images. At each image location we performed recognition test with all three methods. The first, second, and third columns of Fig. 8(a) and (b) show the results obtained by extBW method, rPCA, and SVMh, respectively. The teapot, the ghost, the mouse, and the pelican were searched in the first, second, third, and fourth row of images, respectively. Squares show the places in images where the particular object was recognized. We can noticed that the extBW method has the largest FP rate. The results of recognition are

Table 2 Results for extended BW method, robust PCA, and SVM hierarchies for TOY database on 10 different backgrounds ±n

extBX

rPCA

SVMh

±13

TP [%] FP [%] Unclassified [%]

94.34 5.66 –

96.72 3.28 –

98.07 0.08 1.85

±26

TP [%] FP [%] Unclassified [%]

93.85 6.16 –

96.13 3.87 –

97.18 0.08 2.74

±52

TP [%] FP [%] Unclassified [%]

91.98 8.02 –

92.57 7.43 –

89.87 0.08 10.15

Robustness of all three methods to noise is tested by corrupting the test images with zero mean random noise uniformly distributed in the interval [n, +n].

K. Mele, J. Maver / Pattern Recognition Letters 26 (2005) 1512–1524

highly influenced by the background. The method detects all objects in all given views. In the case of robust PCA we can notice that some frames are misplaced over the two views of the same objects. Some other FPs can also be noticed. The results depend on the background and on the random selected subset of pixels. On reddish background the method was not capable to detect one of the pelican view. Robust PCA employ voting function for recognition (Bischof et al., 2001). Since the nonobject class is not introduced the method needs a threshold to determine weather the value of voting function is sufficient. Fig. 9 shows the recognition results for the pelican obtained by applying different threshold values. The threshold should be selected in accordance with the task, balancing the trade off between the precision and recall. Different objects require different threshold values which have to be set manually. SVM hierarchies give for the first three objects, i.e. the teapot, the ghost and the mouse great results. Results are almost the same for both backgrounds. There are no FPs. The method performs worse in the case of the pelican. All four views of the pelican were found and well localized, but there are many FPs. FPs in the area of teapots are due to very similar color of the peliican and the teapot. The projections of the pelican

1521

from the back side view lies completely inside the teapot, even more, there are many such locations. FPs can be avoided by extending the training set for two-class SVMs with patterns obtained by shifting the masks over the area of the teapot. FPs in the areas of the ghosts are due to the lack of the two-class SVM hierarchy between the pelican and the ghost. Small projection areas of the pelican for some viewing directions and bent shape of the ghost result in a very small (only few pixels) intersection mask. Classifier on such small region would be unreliable; therefore the critical pair between the two objects was not established. This problem could easily be solved by moving the critical pair test to the second level of tree structures and then by considering only the higher levels of two-class SVM hierarchy. 4.3.2.2. Examples on COIL-100 database. All three methods were tested also on COIL-100 database. Fig. 10 shows 21 selected objects from COIL100. Two different test images were prepared to evaluate the methods. Results are presented on Figs. 11 and 12. On both test images all of 21 objects were searched. Results are given in the following way. Above the test image a small image of searched object is given. Next to the small image

Fig. 9. Pelican localization with rPCA. The voting function needs a threshold value. The applied threshold values (form left to right) were: 0, 0.1, 0.3, 0.5 in 1.

Fig. 10. Twenty-one selected object from COIL-100 database.

1522

K. Mele, J. Maver / Pattern Recognition Letters 26 (2005) 1512–1524

Fig. 11. Tests on objects from COIL-100 data base-cheeseboard like background: (a) BW method, (b) rPCA, (c) SVMh.

Fig. 12. Tests on objects from COIL-100 data base-brownish background: (a) BW method, (b) rPCA, (c) SVMh.

K. Mele, J. Maver / Pattern Recognition Letters 26 (2005) 1512–1524

three different marks are possible: A check mark denotes that object was correctly located and recognized. An exclamation mark denotes the presence of false positives. A question mark indicates that object was not recognized. The results for objects not present on the test images are given only in the cases of detected FPs. The performed tests obtained on COIL-100 database support the findings obtained on the TOY database. extBW method has the highest FP rate and has a problem of finding all objects. rPCA also failed to recognize all objects. Some FP are also detected. SVM hierarchies perform the best. The method was capable to recognize all object and gives the smallest number of FPs.

5. Conclusions In the paper we proposed a hierarchy of classifiers for recognition of 3-D objects on clutter background. The applied method combines one- and two-class SVMs trained on different image subsets and different regions in images. Regions are represented by masks organized into a tree structures. One class SVMs limit the space of pattern appearances to the interior of the hyper-spheres while two-class SVMs resolve the situation inside the hyper-spheres projections in proper dimensions. The proposed combination of one- and two-class SVMs hierarchies allows the pattern to be classified as unknown which has a valuable effect on FP rate. The problem of small intersection masks for some objects could be solved by starting the process of recognition at higher levels of the tree structures or by dividing the training set to more than two subsets on some levels in a tree structure. Besides the small number of false positives and robustness to the cluttered background the method has also other important properties: When searching for the particular object in images the method is fast, since it is capable to quickly reject, at the first level of one-class SVMs hierarchy, many patterns at the places in images belonging to the background and other objects. With higher levels of SVMs hierarchy the method is able not just to recognize the object but also to recognize

1523

rough object orientation (front, back, left side, etc.). The main objective of the proposed method was to cope with the clutter background. The performed tests demonstrate that the method achieves good recognition rates even in the presence of moderate amount of noise. In the future we plan to extend the method to be able to cope with different illumination conditions and objects on different scale. Drastic changes in illumination eventually requires the implementation of SVMs with non-linear kernels, however the structure of the scheme remains the same. SVM hierarchies have not been designed to work on different scales. The linear scale-space approach can be applied to the data in the phase of learning and recognition to potentially extend the usage of the method to multiple scale.

Acknowledgement This work was supported in part by the EU project CogVis (IST-2000-2937), the grants funded by the Ministry of Education, Science and Sport: Research Program Computer Vision-1539-506 and SLO-A/07, and by the Federal Ministry of Education, Science and Culture of Austria under the CONEX program. We would like to thank the reviewer for the suggestions which helped us to improve the quality of the paper.

References Belhumeur, P.N., Hespanha, J.P., Kriegman, D.J., 1997. Eigenfaces vs. Fisherfaces: Recognition using class specific linear projection. IEEE Trans. Pattern Anal. Machine Intell. 19 (7), 711–720. Bischof, H., Wildenauer, H., Leonardis, A., 2001. Illumination insensitive eigenspaces. In: Proc. ICCV01. IEEE Computer Society, pp. 233–238. Chen, Y., Zhou, X., Huang, T., 2001. One-class SVM for learning in image retrieval. In: Proc. 2001 IEEE Internat. Conf. on Image Processing (ICIP-01), October 7–10, 2001. IEEE, Thessaloniki, Greece, pp. 34–37. Cristianini, N., Taylor, J.S., 2000. An Introduction to Support Vector Machines and Other Kernel-base Learning Methods. Cambridge University Press.

1524

K. Mele, J. Maver / Pattern Recognition Letters 26 (2005) 1512–1524

Fleming, M., Cottrell, G., 1990. Categorization of faces using unsupervised feature extraction. In: Proc. IJCNN-90, vol. 2. pp. 65–70. Franc, V., Hlavac, V., 2000. Statistical Pattern Recognition Toolbox for Matlab. Available from: . Heisele, B., Ho, P., Poggio, T., 2001. Face recognition with support vector machines: Global versus component-based approach. In: Proc. 8th Internat. Conf. on Computer Vision, vol. 2. pp. 688–694. Leonardis, A., Bischof, H., 2000. Robust recognition using eigenimages. Comput. Vision Image Understand.: CVIU 78 (1), 99–118. Murase, H., Nayar, S.K., 1995. Image spotting of 3D objects using the parametric eigenspace representation. In: Proc. of 9th Scandinavian Conf. on Image Analysis, vol. 1. pp. 323– 332. Nayar, S., Murase, H., Nene, S., 1996. Parametric appearance representation. In: Early Visual Learning. Oxford University Press, pp. 131–160.

Platt, J.C., 1998. Sequential minimal optimizer: A fast algorithm for training support vector machines. Tech. Rep. MSR-TR-98-14, Microsoft Research. Pontil, M., Verri, A., 1998. Support vector machines for 3D object recognition. IEEE Trans. Pattern Anal. Machine Intell. 20 (6), 637–646. Rizvi, S.A., Nasrabadi, N.M., 2002. A modular clutter rejection technique for flir imagery using region-based principal component analysis. Pattern Recognition 35 (12), 2895–2904. Roobaert, D., 2001. Pedagogical support vector learning: a pure learning approach to object recognition. Ph.D. thesis, Royal Institute of Technology (KTH), Department of Numerical Analysis and Computing Science (NADA), Stockholm, Sweden. Scho¨lkopf, B., Platt, J.C., Shawe-Taylor, J., Smola, A.J., Williamson, R.C., 2001. Estimating the support of a highdimensional distribution. Neural Comput. 13 (7), 1443– 1471. Vapnik, V.N., 1998. Statistical Learning Theory. Willey, New York.