On Input Selection with Reversible Jump Markov Chain Monte Carlo ...

Report 14 Downloads 85 Views
On input selection with reversible jump Markov chain Monte Carlo sampling

Peter Sykacek Austrian Research Institute for Artificial Intelligence (OFAI) Schottengasse 3, A-10lO Vienna, Austria peter@ai. univie. ac. at

Abstract In this paper we will treat input selection for a radial basis function (RBF) like classifier within a Bayesian framework. We approximate the a-posteriori distribution over both model coefficients and input subsets by samples drawn with Gibbs updates and reversible jump moves. Using some public datasets, we compare the classification accuracy of the method with a conventional ARD scheme. These datasets are also used to infer the a-posteriori probabilities of different input subsets.

1

Introduction

Methods that aim to determine relevance of inputs have always interested researchers in various communities. Classical feature subset selection techniques, as reviewed in [1], use search algorithms and evaluation criteria to determine one optimal subset. Although these approaches can improve classification accuracy, they do not explore different equally probable subsets. Automatic relevance determination (ARD) is another approach which determines relevance of inputs. ARD is due to [6] who uses Bayesian techniques, where hierarchical priors penalize irrelevant inputs. Our approach is also "Bayesian": Relevance of inputs is measured by a probability distribution over all possible feature subsets. This probability measure is determined by the Bayesian evidence of the corresponding models. The general idea was already used in [7] for variable selection in linear regression mo.dels. Though our interest is different as we select inputs for a nonlinear classification model. We want an approximation of the true distribution over all different subsets. As the number of subsets grows exponentially with the total number of inputs, we can not calculate Bayesian model evidence directly. We need a method that samples efficiently across different dimensional parameter spaces. The most general method that can do this is the reversible jump Markov chain Monte Carlo sampler (reversible jump Me) recently proposed in [4]. The approach was successfully applied by [8] to determine a probability distribution in a mixture density model with variable number of kernels and in [5] to sample from the posterior of RBF regression networks with variable number of kernels. A Markov chain that switches between different input subsets is useful for two tasks: Counting how often a particular subset was visited gives us a relevance measure of the corresponding inputs; For classification, we approximate

On Input Selection with Reversible Jump MCMC

639

the integral over input sets and coefficients by summation over samples from the Markov chain. The next sections will show how to implement such a reversible jump MC and apply the proposed algorithm to classification and input evaluation using some public datasets. Though the approach could not improve the MLP-ARD scheme from [6] in terms of classification accuracy, we still think that it is interesting: We can assess the importance of different feature subsets which is different than importance of single features as estimated by ARD.

2

Methods

The classifier used in this paper is a RBF like model. Inference is performed within a Bayesian framework. When conditioning on one set of inputs , the posterior over model parameters is already multimodal. Therefore we resort to Markov chain Monte Carlo (MCMC) -sampling techniques to approximate the desired posterior over both model coefficients and feature subsets. In the next subsections we will propose an appropriate architecture for the classifier and a hybrid sampler for model inference. This hybrid sampler consists of two parts: We use Gibbs updates ([2]) to sample when conditioning on a particular set of inputs and reversible jump moves that carry out dimension switching updates. 2.1

The classifier

I~ order to allow input relevance determination by Bayesian model selection , the classifier needs at least one coefficient that is associated with each input: Roughly speaking, the probability of each model is proportional to the likelihood of the most probable coefficients, weighted by their posterior width divided by their prior width. The first factor always increases when using more coefficients (or input features). The second will decrease the more inputs we use and together this gives a peak for the most probable model. A classifier that satisfies these constraints is the so called classification in the sampling paradigm. We model class conditional densities and together with class priors express posterior probabilities for classes. In neural network literature this approach was first proposed in [10). We use a model that allows for overlapping class conditional densities:

D

p(~lk)

=L d=l

K

WkdP(~I~) , p(~)

=L

PkP(~lk)

(1)

k=l

Using Pk for the J{ class priors and p(~lk) for the class conditional densities, (1) expresses posterior probabj,Jities for classes as P(kl~) = PkP(~lk)/p(~). We choose the component densities, p(~IcI> d), to be Gaussian with restricted parametrisation: Each kernel is a multivariate normal distribution with a mean and a diagonal covariance matrix. For all Gaussian kernels together, we get 2 * D * I parameters, with I denoting the current input dimension and D denoting the number of kernels. Apart from kernel coefficients, cI>d , (1) has D coefficients per class, Wkd, indicating the prior kernel allocation probabilities and J{ class priors. Model (1) allows to treat labels of patterns as missing data and use labeled as well as unlabeled data for model inference. In this case training is carried out using the likelihood of observing inputs and targets:

p(T, X18) = rrr;=lrr;::=lPkPk(~nk Ifu)rr~=lp(bnI8) , (2) where T denotes labeled and X unlabeled training data. In (2) 8 k are all coefficients the k-th class conditional density depends on. We further use 8 for all model

P. Sykacek

640

coefficients together, nk as number of samples belonging to class k and m as index for unlabeled samples. To make Gibbs updates possible, we further introduce two latent allocation variables. The first one, d, indicates the kernel number each sample was generated from, the second one is the unobserved class label c, introduced for unlabeled data. Typical approaches for training models like (1), e.g. [3] and [9], use the EM algorithm, which is closely related to the Gibbs sampler introduce in the next subsection.

2.2

Fixed dimension sampling

In this subsection we will formulate Gibbs updates for sampling from the posterior when conditioning on a fixed set of inputs. In order to allow sampling from the full conditional, we have to choose priors over coefficients from their conjugate family: • Each component mean, !!!d, is given a Gaussian prior: !!!d '" Nd({di).

• The inverse variance of input i and kernel d gets a Gamma prior:

u;;l '" r( a, ,Bi).

• All d variances of input i have a common hyperparameter, ,Bi, that has itself a Gamma hyperprior: ,Bi ,...., r(g, hi).

• The mixing coefficients, ~, get a Dirichlet prior: ~ '" 1J (6w , ... , 6w ).

• Class priors, P, also get a Dirichlet prior: P '" 1J(6p , ... ,6p). The quantitative settings are similar to those used in [8]: Values for a are between 1 and 2, g is usually between 0.2 and 1 and hi is typically between 1/ Rr and 10/ with Ri denoting the i'th input range. The mean gets a Gaussian prior centered at the midpoint, with diagonal inverse covariance matrix ~, with "'ii = 1/ The prior counts d w and 6p are set to 1 to give the corresponding probabilities non-informative proper Dirichlet priors.

Rr,

Rr.

e,

The Gibbs sampler uses updates from the full conditional distributions in (3). For notational convenience we use ~ for the parameters that determine class conditional densities. We use m as index over unlabeled data and Cm as latent class label. The index for all data is n, dn are the latent kernel allocations and nd the number of samples allocated by the d-th component. One distribution does not occur in the prior specification. That is Mn(l, ... ) which is a multinomial-one distribution. Finally we need some counters: ml ... mK are the counts per class and mlk .. mDk count kernel allocations of class-k-patterns. The full conditional of the d-th kernel variances and the hyper parameter ,Bi contain i as index of the input dimension. There we express each u;J separately. In the expression of the d-th kernel mean, I

641

On Input Selection with Reversible Jump MCMC

illd, we use

.lGt

to denote the entire covariance matrix.

PkP(~mlfu) Mn ( 1, { I:k PkP(~mlfu)' Mn

(1, {I:,WtndP(~nl~)

Wt,.dP(~nl~)

})

k= l..K ,d= l..D})

p(~J ..)

r

(9 +

p(~I···)

1)

(ow + mlk, ... ,ow + mDk)

p(PI···) p(illdl···)

1)

(op

r

(a + ~d, + ~ Vnld,.=d L (~n,i -llid,i)2)

Da. hi

(3)

+ ;; ud,! )

+ ml, ... , op + mK) N ((nd~l + ~)-l(ndVdl~ + ~S), (ndVd 1 + ~)-l) f3i

i£,.

2.3

Moving between different input subsets

The core part of this sampler are reversible jump updates, where we move between different feature subsets. The probability of a feature subset will be determined by the corresponding Bayesian model evidence and by an additional prior over number of inputs. In accordance with [7J, we use the truncated Poisson prior:

p(I) = 1/ ( I jax ) c ~~ , where c is a constant and Imax the total nr. of inputs. Reversible jump updates are generalizations of conventional Metropolis-Hastings updates, where moves are bijections (x, u) H (x', u'). For a thorough treatment we refer to [4J. In order to switch subsets efficiently, we will use two different types of moves. The first consist of a step where we add one input chosen at random and a matching step that removes one randomly chosen input. A second move exchanges two inputs which allows "tunneling" through low likelihood areas. Adding an input, we have to increase the dimension of all kernel means and diagonal covariances. These coefficients are drawn from their priors. In addition the move proposes new allocation probabilities in a semi deterministic way. Assuming the ordering, Wk,d ~ Wk,d+1:

op

Vd

~ D/2

Beta(b a , bb + 1)

= Wk,D+l-d + Wk ,dOp { W~'D+l-d w~ , d = wk ,d(1 - op)

(4)

The matching step proposes removing a randomly chosen input. Removing corresponding kernel coefficients is again combined with a semi deterministic proposal of new allocation probabilities, which is exactly symmetric to the proposal in (4).

642

P. Sykacek

Table 1: Summary of experiments

Data Ionosphere Pima Wine

avg(#) 4.3 4 4.4

max(#) 9 7 8

We accept births with probability: n,

min( 1, lh. rt x

x

(~':)

RBF (%,n a ) (91.5,11) (78 .9,111 (100, 01

r

MLP (%,nb) 95.5,4 79.8,8 96.8,2

p(;(;/) G, J2,; g

f g(.,.~

-')"-1 exp( -Ii'

exp ( - 05 ;"

(I'd -