Fast Multiple Instance Learning Via L1;2 Logistic ... - IEEE Xplore

Report 3 Downloads 62 Views
Fast Multiple Instance Learning Via L1,2 Logistic Regression Zhouyu Fu1 and Antonio Robles-Kelly1,2 RSISE, Bldg. 115, Australian National University, Canberra ACT 0200, Australia 2 National ICT Australia (NICTA)∗, Locked Bag 8001, Canberra ACT 2601, Australia 1

Abstract In this paper, we develop an efficient logistic regression model for multiple instance learning that combines L1 and L2 regularisation techniques. An L1 regularised logistic regression model is first learned to find out the sparse pattern of the features. To train the L1 model efficiently, we employ a convex differentiable approximation of the L1 cost function which can be solved by a quasi Newton method. We then train an L2 regularised logistic regression model only on the subset of features with nonzero weights returned by the L1 logistic regression. Experimental results demonstrate the utility and efficiency of the proposed approach compared to a number of alternatives.

1. Introduction Multiple-instance learning (MIL) [6] is a paradigm in machine learning in which, unlike conventional supervised learning scenarios, the problem of learning and classification is addressed at the bag level. In the binary classification MIL setting, a bag is a collection of instances where the aim of computation is to assign bags to positive and negative classes. MIL has a lot of potential in many applications in computer vision and pattern recognition. For instance, in content-based image retrieval (CBIR), each image contains many regions, but only a subset of them are of interest. Here an image is a bag and the image regions are instances and the CBIR problem, which can be cast in a MIL setting. In a typical MIL scenario, a negative bag only consists of negative instances, whereas a positive bag comprises both, positive and negative ones. Due to the existence of outliers in positive bags, applying conventional supervised classification algorithms directly to MIL problems often leads to a much downgraded per∗ NICTA is funded by the Australian Government as represented by the Department of Broadband, Communications and the Digital Economy and the Australian Research Council through the ICT Centre of Excellence program.

978-1-4244-2175-6/08/$25.00 ©2008 IEEE

formance. As a result, special purpose methods have to be designed to handle the MIL scenario, including axis parallel hyper-rectangles [6], Diverse Density (DD) [9], EM-DD [12], and SVMs [3, 5]. The MILES algorithm proposed by Chen et.al. [5] is of particular interest for MIL. It converts a multiple-instance learning task to the conventional classification problem by constructing bag-level features from bag to instance similarities and train a L1 norm SVM on the feature embedding. The reason for choosing L1 norm SVM over the conventional L2 norm SVM is due to the high dimensionality of the feature space, which is given by the total number of instances in the training set. Consequently, there are a large number of redundant features that may influence the accuracy of the classifier. The L1 norm is used for both, feature selection and classification purposes. This leads to a sparse solution of feature weights, i.e., a solution with many nil features whose weights are zero. The nil features correspond to either redundant ones or those which do not contribute any information to the classification task. Despite the effectiveness of this treatment, training an L1 norm SVM is time consuming. This is the case even for moderately large data sets as it involves solving a linear program (LP) problem which can be computationally demanding.

2. Motivation and Contributions In contrast to LP, an unconstrained optimisation problem with a convex differentiable cost function is much cheaper to solve. One such example is the logistic regression model [4], which can be solved through iterative optimisation procedures like gradient descent, Newton and quasi Newton methods [11]. The main difficulty in L1 norm-based regularisers lies in their nondifferentiability. If the sparsity pattern of the feature weights can be identified, then the problem is much easier. In this paper, we proposed a novel logistic regression model which we call the L1,2 logistic regression, which combines L1 and L2 norm-based regularisation techniques. First, we apply L1 logistic regression to the

data to recover the sparsity pattern. To do this, we only need to solve the L1 logistic regression problem approximately. This is achieved by approximating the nondifferentiable L1 function with a differentiable, convex log hyperbolic function parametrised by a bandwidth parameter. This function converges pointwise to the L1 function while the bandwidth parameter approaches to infinity. We then apply a limited memory quasi Netwon method (L-BFGS) [8] to the new cost function with the L1 term replaced by the log hyperbolic function so as to obtain the solution of the L1 logistic regression problem. Then we obtain the final estimate of feature weights by training an L2 logistic regression model on the subset of features with non-zero weights as estimated by the L1 logistic regression model. Both steps are very efficient due to the use of differentiable functions and the L-BFGS method for unconstrained optimisation.

3. Logistic Regression for MIL 3.1. Preliminaries and Feature Representation At this point, it is worth noting that we focus on binary MIL problems, as any multiclass problem can be converted to several binary problems using the oneagainst-others strategy. To describe the MIL problem and our algorithm, we need to introduce some notation. Denote B tr = {B1 , . . . , Bm } as the set of bags for training and their corresponding labels Y = {Y1 , . . . , Ym } with Yi = {1, 0} for each i. Each bag Bi contains ni instances denoted by xi,j for j = 1, . . . , ni . Without disambiguation, and depending on the context, xi,j also denotes the feature vector for the instance. Different bags can have different number of instances, hence ni may vary for different i’s. To each instance xi,j also corresponds a label which is not directly observable. Two assumptions are made about instancelevel labels. First, all instances in each negative bag are negative. On the other hand, at least one instance in each positive bag is positive. The purpose is, therefore, to predict the label value for the novel testing data B = {x1 , . . . , xl }. With the above ingredients, we now describe the feature embedding strategy, which is first introduced in [5]. For the sake of clarity, we re-index all instances P in the training set as xj for j = 1, . . . , n, where n = i ni is the total number of instances for training. An ndimensional feature vector is constructed for each bag, where the jth feature component is given by the bag-toinstance similarity as follows,   d(Bi , xk )2 (1) s(Bi , xk ) = exp − 2σ 2 d(Bi , xk ) = min ||xi,j − xk ||2

Note that d(Bi , xk ) in the above equation is a special case of the Hausdorff distance defined over two collection of sets, where the second set is a singleton. The assumption is that the closest instance in each bag to an instance prototype carries the maximum amount of category information. This is usually valid for MIL scenarios. The resulting feature vector for bag i becomes zi = [s(Bi , x1 ), . . . , s(Bi , xm )]T . This kind of formulation allows for flexibility and robustness in feature mapping, even if the closeness assumption breaks down for certain prototypes. The feature mapping turns out to be quite tolerant to possible inaccuracies in noisy data. As long as the assumption above is true for the majority of the prototypes, the feature mapping is still informative.

3.2. L1,2 Logistic Regression Model We first consider logistic regression models for binary classification, which make use of the following logit transform for modeling the posterior distribution of label values, P (fi = 1|xi )

=

P (fi = 0|xi )

=

1 1 + e−wT x−b 1 − P (fi = 1|xi )

(2)

where fi is the predicted label for the data point xi and w and b denote the linear weights and bias, respectively. Here, the variables w and b are the parameters to be estimated. To simplify the notation, for the discussion below, we omit the bias term b, which can be processed in the same way as the weight w without loss of generality. Given a training sample (xi , yi ) (i=1,. . . ,m) of data points and labels, the logistic regression model aims at recovering the parameters that minimise the following cost function of negative log-likelihood l(w; X, Y ) = −

N X

logP (fi = yi |xi )

(3)

i=1

Note that the above cost function is continuous, differentiable and convex. The global minimum of the function can be found by using a number of iterative optimisation schemes. In practice, a prior probability is applied to the linear weights to control the complexity of the logistic model. This leads to the new cost function with an additional regularisation term given by f (w) = l(w; X, Y ) + λr(w)

(4)

Two common choices for the regularisation term, as plotted in Figure 1(a), are the L1 and L2 cost functions below r1 (w) = |w|

r2 (w) = w2 which correspond to the Gaussian prior and the Laplacian prior on the parameters w respectively. Both L1 and L2 regularisers have their strengths and weaknesses. The L2 logistic regression model is a natural choice, as the cost function is differentiable and can be optimised efficiently. For training a large scale L2 logistic regression model, we use the L-BFGS method [8] which only requires the gradient of the L2 cost function given by ∇w f2 (w)

=

m X

(u(xi ) − yi )xi + 2λw

(5)

i=1

u(xi )

=

P (fi = 1|xi ) =

1 1 + e−wT x−b

The L1 logistic regression model produces a sparse solution and is well suited for feature selection in high dimensional data spaces [10]. Nevertheless, solving the L1 regularisation model is not a straightforward task since it is non-differentiable at the origin. A few fast methods have been proposed to scale-up the training of L1 logistic regression models [2, 7]. The common idea is to identify the sign of the weights first and drop the absolute values. Subsequent optimisation is applied to the subset of nonzero weights only keeping their signs. Here we propose a simpler approach to find the sparsity pattern by approximating the L1 regulariser with a convex differentiable log-hyperbolic function given by 1 log(cosh (σx)) (6) σ where σ is a bandwidth parameter controlling the approximation error. As can be seen in Figure 1(b), a larger σ yields a better lower bound with smaller approximation error. As σ tends to infinity, the approximation error approaches 0 and the function r0 (x) converges to r(x) = |x| pointwise. This means the approximation error is determined solely by the parameter σ and does not depend on x. In our experiments, we set σ = 10, which achieves an approximation close enough to r1 (x) = |x| and still maintains numerical stability. On the other hand, r0 (x) is differentiable everywhere, even at the origin, as can been seen from the zoom-in view in Figure 1(c). This allows us to use the optimisation technique for the L2 regulariser so as to minimise the L1 cost function. The gradient of the L1 regulariser is similar to that of L2 in Equation 5, with the second term on the RHS replaced by r0 (x) =

∇w r1 (w) = tanh(σw) After minimising the L1 cost, many elements in w will varnish. Due to the nature of the numerical optimisation, these weights will not be exactly zero, but a very

(a)

(b)

(c)

Figure 1. Comparison of regularisation terms. Algorithms L1,2 logreg L2 logreg MILES

1000-Image Set 82.0: [ 80.7, 83.2] 82.7: [ 81.3, 84.1] 82.3: [ 81.4, 83.2]

2000-Image Set 69.9: [ 69.1, 70.7] 66.9: [65.9, 67.9] 68.7: [ 67.3, 70.1]

Table 1. Comparison of classification accuracy for the proposed algorithm against the alternatives. small number which can be identified with a threshold. In our experiments, we set this threshold to be the inverse of σ. L2 logistic regression is then applied to the selected feature subsets comprised by those whose weights, as returned by the L1 regulariser, are non-zero. We then substitute the solution back into the original feature weights. For multiclass logistic regression, we simply learn a regression model for each class against all others. The labels for the test data are assigned by majority-voting over the decision values delivered by the pairwise regression models.

4. Experimental Results In this section, we demonstrate the utility of the proposed algorithm for MIL on region-based image categorisation. To this end, we have used the COREL data set. This data set contains 2000 images taken from 20 different categories, with 100 images in each category. Each image is segmented into several regions and features are extracted from each region. This is a typical MIL problem with images as bags and region features as instances. Details of segmentation and feature extraction are beyond the scope of this paper and interested readers are referred to [5] for more details on the database. Here, we compare our algorithm against two alternatives. These are the L2 logistic regression and the MILES algorithm [5]. Both logistic regression models are implemented in C++ and, for the linear programming component of MILES, we have used the MOSEK optimisation software [1]. For the L2 logistic regression, we fix the number of iterations to 1000 and the regularisation parameter λ to 0.1. For MILES, we adopt the optimal parameter settings reported in [5].

Algorithms L1,2 logreg L2 logreg MILES

1000-Image Set 22.9 ± 3.5 9.7 ± 1.9 263.3 ± 6.2

2000-Image Set 188.4 ± 10.7 128.8 ± 3.8 3619 ± 210.5

Table 2. Comparison of running speed in seconds for the proposed algorithm with MILES.

Figure 2. Distributions of feature weights. From leftto-right: MILES, L1,2 and L2

5. Conclusions We conduct two experiments on the image data set. The first of these uses the first 10 categories in the data set for training and testing. The second experimentset uses the complete data set with all its 20 categories. For both experiments, we randomly split all images into 50 − 50% for training and testing data, respectively. Training and testing are repeated for 5 different random splits. The one-against-others strategy is adopted in order to handle the multiclass classification tasks. The results of classification accuracy rates are outlined in Table 1. From the table, we can conclude that both logistic regression models are very competitive for image categorisation tasks in terms of classification accuracy in comparison with MILES. This is especially encouraging, considering the fact that MILES consistently outperforms other MIL approaches on the COREL image data set. Note that the performance of the L2 logistic regression model is downgraded as compared to the other two methods on the large data set of 2000 images, while it performs quite well on the 1000-image data set. This validates the necessity of imposing sparsity constraints for learning with high dimensional features [10]. The L1,2 logistic regression model, on the other hand, appears to be devoid of this problem and, as a result, outperforms the alternatives. We now turn our attention to the speed of the methods under study. Table 2 lists the time in seconds spent on training on classes for different methods. We can see that logistic regression models are much more efficient to train than MILES. This is due to the complexity of the LP optimisation required by MILES. The L2 logistic regression is faster than the L1,2 model, but it does not produce a sparse solution. This affects the efficiency at testing time, since many more features need to be calculated and evaluated. This can be clearly seen from the plots of feature weight distributions for MILES, L1,2 , and L2 logistic regression in Figure 4. While most feature weights for MILES and the L1,2 model are 0, with a few features with large weights that carry classification information, the majority of the feature weights for L2 regularisation are small. This makes L2 less suitable for feature selection.

We have proposed an efficient logistic regression model by combining the L1 and L2 regulariser and applied it to feature selection and classification for multiple-instance learning. A fast two-step optimisation scheme with a differentiable cost function is proposed so as to perform the minisation task effectively. Experimental results on region-based image categorisation demonstrate the effectiveness and utility of the proposed method. This suggests the utility of the method for object categorisation tasks in very high-dimensional spaces.

References [1] Mosek. http://www.mosek.com, 2001. [2] G. Andrew and J. Gao. Scalable training of l1regularized log-linear models. In ICML, 2007. [3] S. Andrews, I. Tsochantaridis, and T. Hofmann. Support vector machines for multiple-instance learning. In NIPS, 2003. [4] C. M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006. [5] Y. Chen, J. Bi, and J. Wang. Miles: Multiple-instance learning via embedded instance selection. IEEE Trans. on PAMI, 28(12):1931–1947, 2006. [6] T. Dietterich, R. Lathrop, and T. Lozano-Perez. Solving the multiple-instance problem with axis-parallel rectangles. Artificial Intelligence, 89(1–2):31 – 71, 1997. [7] K. Koh, S.-J. Kim, and S. Boyd. An interior-point method for large-scale l1-regularized logistic regression. Journal of Machine Learning Research, 8:1519– 1555, 2007. [8] D. Liu and J. Nocedal. On the limited memory bfgs method for large scale optimization. Mathematical Programming B, 45(3):503–528, 1989. [9] O. Maron and T. Lozano-Perez. A framework for multiple-instance learning. In NIPS, 1998. [10] A. Ng. Feature selection, l1 vs. l2 regularization, and rotational invariance. In ICML, 2004. [11] W. H. Press, S. A. Teukolsky, W. T. Vetterlin, and B. P. Flannery. Numerical Recipes in C. Cambridge University Press, 1992. [12] Q. Zhang and S. Goldman. Em-dd: An improved multiple-instance learning technique. In NIPS, 2002.