Novelty Detection Employing an L2 Optimal ... - Semantic Scholar

Report 2 Downloads 47 Views
Novelty Detection Employing an L2 Optimal Nonparametric Density Estimator Chao He, Mark Girolami 1 Bioinformatics Research Centre , Department of Computing Science, University of Glasgow, Glasgow G12 8QQ, United Kingdom

Abstract This paper considers the application of a recently proposed L2 optimal nonparametric Reduced Set Density Estimator to novelty detection and binary classification and provides empirical comparisons with other forms of density estimation as well as Support Vector Machines. Key words: Reduced Set Density Estimator (RSDE); Novelty detection; Binary classification

1

Introduction

Novelty Detection (Roberts, 2000; Sch¨olkopf et al., 2001; Campbell and Bennett, 2001), One-Class Classification (Tax and Duin, 1999) or Outlier Detection (Wang et al., 1997; Barnett and Lewis, 1977; Anderson, 1958) is a problem of some significant theoretical (Sch¨olkopf et al., 2001) and practical interest. The parametric statistical approaches to outlier detection, which impose a strong assumption of the data having a Gaussian distribution, are hypothesis tests based on a form of T 2 -statistic (Anderson, 1958; Barnett and Lewis, 1977). By relaxing the Gaussian assumption semi-parametric approaches have been proposed combining mixture modelling, Extreme Value Theory (Roberts, 2000) and the Bootstrap (Wang et al., 1997) in defining tests of the outlying nature of a newly observed datum. Methods for outlier testing based on the support vector method have been proposed and have been shown to be highly effective for cases where there is a paucity of data drawn from the underlying 1

Corresponding author. Email addresses: [email protected] (C. He), [email protected] (M. Girolami)

Preprint submitted to Elsevier Science

6 April 2004

distribution and/or an associated density function which may not necessarily be in existence (Tax and Duin, 1999; Sch¨olkopf et al., 2001; Campbell and Bennett, 2001). This paper considers the case where data scarcity is not a constraint and that the continuous distributional characteristics of the data suggest the existence of a well formed density function. Such situations are quite the norm in the majority of real applications such as continuous monitoring of the condition of a machine or process - indeed the reverse ‘problem’ is often experienced in many situations where there is an overwhelming amount of data logged. In situations where the volume of data to be processed is large a semi-parametric mixture model can provide a reduced representation of the reference data sample, in the form of the estimated mixing coefficients and component sufficient statistics, for testing of further observed data for novelty. On the other hand non-parametric approaches such as K-nearest neighbour or the Parzen window density estimators require the full reference set for testing which in such practical circumstances can be prohibitively expensive for testing purposes. The support vector approach to novelty detection and density estimation has also been observed to provide a sparse or condensed density representation (Vapnik and Mukherjee, 2000; Tax and Duin, 1999; Sch¨olkopf et al., 2001). A recently proposed Reduced Set Density Estimator (RSDE) (Girolami and He, 2003) addresses the above problem by providing a kernel density estimator which employs a small subset of the available data sample. It is optimal in the L2 sense in that the integrated squared error between the unknown true density and the RSDE is minimised in devising the estimator. Whilst only requiring O(N 2 ) optimisation routines to estimate the required kernel weighting coefficients, the RSDE provides similar levels of performance accuracy and sparseness of representation as Support Vector Machine (SVM) (Vapnik and Mukherjee, 2000) density estimation, which requires O(N 3 ) optimisation routines. The additional advantage of the RSDE is that no extra free parameters are introduced such as regularisation terms(Weston et al., 1999), bin width (Holmstr¨om, 2000; Scott, 1985) or number of nearest neighbours (Mitra et al., 2002) making this method a very simple and straightforward approach to providing a reduced set density estimator with comparable accuracy to that of the full sample Parzen density estimator. Knowing the RSDE has the above advantages for density estimation, it would then be interesting to investigate the performance achieved when applied to novelty detection and classification (this term is used specifically to indicate two-class or binary classification in the remaining context). By introducing the RSDE density estimation method (Girolami and He, 2003) in Sec. 2, a statistical hypothesis testing based novelty (outlier) detector and a Bayesian classifier are devised in Sec. 3 and Sec. 4 respectively. Performances of both the RSDE novelty detector and the RSDE classifier are assessed in Sec. 5. 2

Experimental results indicate that the RSDE based novelty detector and binary classifier achieve statistically similar accuracy as those based on the full sample Parzen window density estimator while reducing computational costs for the test by 65% to 80% on average.

2 2.1

Reduced Set Density Estimator L2 Distance Based Density Estimation

Based on a data sample S = {x1 , · · · , xN } ∈ Rd the general form of a kernel P density estimator is given as pˆ(x; h, γ) = N n=1 γn Kh (x, xn ). For a given kernel with width h the maximum likelihood estimator (MLE) criterion (McLachlan and Peel, 2000) can be employed to estimate the weighting coefficients subject P to the constraints n γn = 1 and γn ≥ 0 ∀ n, which yields values for the coefficients such that γn = N1 ∀ xn ∈ S i.e. the Parzen window density estimator (Girolami and He, 2003). Alternative Distance based criteria have been considered for the purposes of density estimation when employing mixture models (Scott, 1999). In particular the L2 criterion, based on the Integrated Squared Error (ISE), has been investigated as a robust error criterion which will be less influenced by the presence of outliers in the sample and model mismatch than the MLE criterion (Scott, 1999). For a density estimate with parameters θ denoted as pˆ(x; θ) the argument R which provides the minimum ISE, defined as Rd |p(x) − pˆ(x; θ)|2 dx, can be written as follows. ˆ = arg min θ θ

Z Rd

pˆ 2 (x; θ)dx − 2Ep(x) {ˆ p(x; θ)}

(1)

Where Ep(x) {·} denotes expectation with respect to the unknown density p(x). 2.2 Plug-In Estimation of Weighting Coefficients An unbiased estimate of the right-hand expectation in (1) can be obtained as a γi weighted sum ofR full Parzen density estimators pˆh (xi ) of each point xi . The left-hand term Rd pˆ 2 (x; θ)dx can be computed in a quadratic form R PN i,j=1 γi γj C(xi , xj ) where C(xi , xj ) = Rd Kh (x, xi )Kh (x, xj )dx. Combining both, the minimisation of a plug-in estimate of the ISE (1) for a kernel density estimator pˆ(x; θ) = pˆ(x; h, γ) can be written as a constrained quadratic 3

optimisation (Refer to Girolami and He (2003) for further details) which in familiar matrix form is 1 arg min γ T Cγ − γ T p γ 2 T subject to γ 1 = 1 and γi ≥ 0 ∀ i

(2)

Where the N × N matrices with elements C(xi , xj ) and Kh (xi , xj ) are defined as C and K respectively. The N × 1 vector of Parzen density estimates of each P point in the sample pˆh (xi ) = N1 N j=1 Kh (xi , xj ) is defined as p = K1N , where 1N is the N × 1 vector whose elements are all N1 . The above minimisation of a plug-in estimate of the ISE for a general kernel density estimator yields a sparse representation in the weighting coefficients (Refer to Girolami and He (2003) for detailed discussion). The minimisation specified by (2) can simply be solved by applying standard constrained quadratic programming which achieves O(N 3 ) scaling. Girolami and He (2003) proposed appropriate forms of the multiplicative updating (Sha et al., 2002) and the Sequential Minimal Optimisation (SMO) (Sch¨olkopf et al., 2001) methods for (2) and these further reduce the scaling to O(N 2 ). The RSDE was shown to be able to provide a sparse representation and reduce the computational costs of the full Parzen window density estimation without degrading performance accuracy (Girolami and He, 2003). In the following sections, the RSDE based novelty detection and binary classification approaches are devised.

3

Hypothesis Testing and Novelty Detection

Novelty Detection is, in most situations, characterised as a problem of identifying examples from a data sample which are possibly discordant with the existing sample (Barnett and Lewis, 1977). Such a problem can be posed in terms of a statistical hypothesis test which can be represented in the following manner, a finite sample of instances of a random vector x ∈ Rd has a probability density such that the sample {x1 , · · · , xN } is independently and identically distributed as p(x; θ) where θ denotes the parameters of the appropriate distributional form. For an additional example xN +1 the null hypothesis is that sample N + 1 is drawn from the same distribution, i.e. xN +1 ∼ p(x; θ). The alternate hypothesis is that the example is drawn from another distribution characterised by, for example, a different location parameter. In the absence of any prior information about the alternate distribution it should be noted that an α-level significance test will define a region, say C d , defined by a constant level of the density p(x; θ). Therefore the alternate distribution will 4

have finite support and the least committal distributional form (maximumentropy) is the uniform distribution over the region of support. The alternate hypothesis is written as xN +1 ∼ US where US denotes the uniform distribution over the region of support S d = Rd \ C d . Formally, for a given data sample x1 , · · · , xN the following test for a new point xN +1 is given H0 : xN +1 ∈ C d vs. H1 : xN +1 ∈ / C d . The standard test statistic employed is the likelihood ratio (Anderson, 1958) sup QN +1 p(xn ; θ) θ N +1 n=1 λ= . sup US × QN p(xn ; θ) n=1 θN

(3)

For the case where p(x; θ) is multi-variate normal with mean µ and covariˆ N then the test statistic ˆ N and C ance C, whose N sample estimates are µ emerging from the above likelihood ratio takes the form of a modified T 2 statistic (Anderson, 1958). Consider an additional N + 1’th example denoted 2 as xN +1 , it can be shown that the associated test statistic D2 = (NN+1)2 (xN +1 − ˆ −1 ˆ N )T C ˆ N ) is related to a central F -distribution with d and N − d µ N (xN +1 − µ degrees of freedom by F = (N − d)(N + 1)D2 /[dN 2 − (N + 1)dD2 ]. As the null hypothesis states that all the points including the N + 1’th sample have a common mean and covariance then this point would be rejected at the α significance level if F > Fα;d,N −d . The α-level test defines an elliptical region bounded by the value of D2 corresponding to the α significance level given by ˆ N . Whilst appealing to the central limit ˆ N and C the estimated parameters µ theorem to justify assumptions of data normality provides an elegant closed form null distribution on which to threshold a novelty (outlier) detector, the strong assumption on the parametric form of the data distribution is often too restrictive in many practical applications (Sch¨olkopf et al., 2001; Tax and Duin, 1999). For an arbitrary and non-Gaussian density the likelihood ratio test statistic no longer has a closed form representation for its distribution under the null hyˆN ) + pothesis. However, it should be noted that asymptotically λ → p(xN +1 |θ −1 Op (N ) and so the test statistic which emerges is simply the probability density estimate, based on the original N samples, of the test point (Wang et al., 1997). The distribution of the test statistic under the null hypothesis can be estimated employing the bootstrap (Efron and Tibshirani, 1993) and so the various critical values of λcrit which define the specific significance level of the test for the null hypothesis can be established from the numerically obtained empirical distribution (Wang et al., 1997). Employing this test statistic to provide an α-level significance test of data novelty requires the estimate of the probability density p(x; θ). In Sec. 5, we employ the RSDE and compare it with the Parzen window 5

density estimation method to provide the required test statistic for a novelty detector. In addition, the Support Vector Data Description method (Tax and Duin, 1999; Sch¨olkopf et al., 2001), which is specifically designed for one-class classification (novelty detection) is also employed for comparison.

4

Bayesian Classification

So far, we have discussed using RSDE or other density estimators to build a novelty (outlier) detector for one-class classification. In this section the multiclass situation will be considered. Classically for multi-class classification, it is desired to predict the posterior probability of membership of one of the classes given the input x. To obtain a probabilistic classifier with a density estimator we train an estimator pˆc (x; θ) = pˆ(x; θ|c) for each class c, and apply Bayes’ rule to obtain the posterior probability of class membership pˆ(x; θ|c)Pˆ (c) , Pˆ (c|x; θ) = P ˆ(x; θ|c0 )Pˆ (c0 ) c0 p

(4)

then the test sample x is assigned to the class having the largest posterior probability. When applying the RSDE as the estimator to build the classifier, parameters θ = {h, γ} are estimated for each class. During training, the kernel width (free parameter) h is tuned by cross-validation based on the classification error of the validation set, and the weighting coefficients γ are obtained by optimising (2) over training samples. The same is done for choosing the h value of the Parzen window estimator.

5

Experiments

5.1 Novelty Detection Experiments In this section the RSDE along with the Parzen window (PW) non-parametric density estimators are employed to provide the required test statistic for a novelty detector defined by (3). These novelty detection results are then compared with the Support Vector Data Description (SVDD) method that was specifically designed for one-class classification (novelty detection) (Tax and Duin, 1999). 6

The distribution of the test statistic under the null hypothesis is obtained by using the individual bootstrap samples to define the N -sample reference data set which defines the density estimate, a further single N + 1’th datum is then drawn from the available data sample and the test statistic for the bootstrap ˆ is computed. sample pˆb (xN +1 ; θ)

5.1.1 Handwritten Digits Dataset In the handwritten digits dataset 2 there are 200 examples of each of the digits 0 – 9, and six different feature sets available including Fourier (76dimension), Profile (216-dimension), Karhunen-Lo´eve (64-dimension), Pixel (240-dimension), Zernike (47-dimension) and Morphological (6-dimension). In (Tax and Duin, 1999), the Fourier, Profile, Zernike and Pixel four individual feature sets were investigated. In the following experiment, the Fourier, Zernike and Morphological feature sets are put together to create a single feature set with higher sample diversities than each individual one. The digits 0 – 9 in turn are chosen here to be the ‘normal class’ against which all other digits are measured for novelty. Repeating the same approach taken in (Tax and Duin, 1999), the 200 samples of the target object are split into training set and testing set (for evaluating the false rejection performance of the novelty detector) with 100 samples each. The remaining 1800 samples of all other digits are used as outlier testing objects (for evaluating the false acceptance performance of the novelty detector). Considering only 100 target objects are available for parameter estimation, the 5-dimensional principal component subspace is used in this experiment where over 90% of the variance in the data is retained. The RSDE and the Parzen window density estimators were utilised to estimate the density of the feature set for the target digit. The null-distribution of the test statistic was then obtained by a 1000-step bootstrap (Wang et al., 1997). The kernel widths for the Parzen window density estimate was selected by 10-fold cross-validation. The kernel width for the RSDE was selected by 10-fold estimates of integrated square error with the Parzen window estimator. By setting different thresholds (false rejection rates) to define significance level tests of 1%, 5%, 10%, 15%, and 25%, the integrated Receiver-Operating Characteristic (ROC) errors (Metz, C., 1978; Tax and Duin, 1999) are calculated and shown in Tab. 1, in which RR indicates the RSDE’s remaining sample ratio which is the percentage of the size of the reduced set obtained by the RSDE over the size of the original reference set used by the full Parzen window estimator. The last row of the table shows the average performance of each method averaged over all 10 target classes. The results of the one-class 2

Multiple Features Dataset: ftp://ftp.ics. uci.edu/pub/machine-learning-databases/mfeat/

7

classifier SVDD are also listed in the table for comparison. Table 1 Integrated ROC errors (1% – 25%) of the novelty detection tests on the handwritten digits dataset. Class

RSDE (%)

PW (%)

SVDD (%)

RR (%)

0

3.23

3.44

2.14

4

1

1.90

5.44

10.93

75

2

2.19

6.68

4.06

13

3

3.52

10.37

6.94

59

4

3.28

8.20

8.59

30

5

2.69

7.44

2.94

20

6

5.19

9.01

7.71

22

7

0.82

2.17

8.32

65

8

1.44

2.05

5.46

70

9

5.14

9.13

7.89

11

Average

2.94

6.39

6.50

36.9

In Tab. 1, the RSDE novelty detector shows the best performance on this single split for almost all 10 target classes, and is shown to have superior novelty detection performance for all target classes compared to the Parzen window estimator, which was stated to provide the best overall performance for well sampled data in (Tax and Duin, 1999) after extensive testing of varying ‘one-class classifiers’ on a large number of diverse types of data sets. On average, the RSDE achieves better performance than the Parzen window and the SVDD on this data set; meanwhile reducing the sample size by approximately 63% that means only about 37% of the target sample is employed in the novelty assessments of all other samples. The RSDE has been shown previously to be capable of producing smoother density estimates than the Parzen window density estimator which may face the problem of over-fitting (Girolami and He, 2003) that results in a larger test error for the Parzen window. The performance of the SVDD is poorer because no estimate of a density is made. The above results only provide some indication as to the significance of the measured performance. However as there is no indication of the variability in performance figures, therefore a more extensive study is presented in Sec. 5.1.2 where statistical tests are utilised to give more complete comparisons.

5.1.2 R¨atsch’s Real Dataset Collections Further extensive experiments were executed on a subset of R¨atsch’s real dataset collections which included the data sets Banana, Titanic, Diabetes, Breast Cancer and German. Utilised in R¨atsch et al. (2001), these data sets are available at http://ida.first.gmd.de/∼raetsch, which provides 100 8

training and test splits of two classes for each data set. In our experiments, the first 10 splits were used. Within each split, N1R training samples from the first class were utilised as the reference set to train the RSDE, the Parzen Window and the SVDD novelty detectors; N1V test samples from the same class were utilised as the validation set to evaluate the false rejection performance; and all N2 training and test samples from the second class were measured for novelty. Experimental results are shown in Tab. 2, where integrated ROC errors (1%–25%) of three novelty detectors over 10 splits are displayed by their mean and standard deviation (STD) values. As with the previous set of results the RSDE performance appears to be superior on average. Table 2 Novelty detection results of R¨atsch’s data collections. Integrated ROC Error (Mean ± STD)%

RR(%)

Data set

N1R

N1V

N2

d

RSDE

PW

SVDD

(Mean ± STD)

Banana

183

2741

2376

2

4.46±1.39

5.60±0.88

12.21±1.97

13.52±2.18

Titanic

107

1383

711

3

7.46±2.12

7.89±4.65

7.92±2.49

66.15±6.88

Diabetes

298

202

268

8

10.15±3.18

14.56±4.06

10.88±2.24

2.21±0.88

Breast Cancer

142

54

81

9

10.01±3.49

13.81±4.21

10.85±4.18

11.18±1.27

German

483

217

300

20 a

12.64±5.05

13.80±3.57

13.28±2.92

6.28±0.90

a

In experiments, the 20-dimensional data set German was reduced to 7-dimensional by carrying out the Kolmogorov-Smirnov test to remove those dimensions without discrimination powers.

The distribution of test errors is Gaussian (assessed using the Jarque-Bera test (Judge et al., 1988) (α = 0.01)), so a T -test is utilized to compare performance between the different novelty detectors. Although in Tab. 2 the RSDE novelty detector continues to show the lowest test errors (mean) for all data sets among three methods we investigated, the T -test results indicate that, there is no significant difference (α = 0.01) between performances of the RSDE and the Parzen window approaches for all five data sets. However on average the sample size utilised for tests by the RSDE approach is only 19.87% of that utilised by the Parzen window approach, i.e. the RSDE novelty detector reduces the test computational costs by approximately 80%. Apart from Banana, statistically there is also no significant difference (α = 0.01) between performance of the RSDE and the SVDD approaches for other data sets. By plotting data points of Banana data set (reference set distributed approximately in a ring shape), a large quantity of outliers are found located inside the boundary defined by support points obtained by the SVDD which results in a very high false acceptance rate. Because the SVDD only defines a closed boundary of the reference set rather than estimating the density distribution, it has limitations for some specially distributed data sets like Banana. Furthermore, unlike the RSDE approach whose sample points used for the 9

test and sample reduction rates are fixed for all different significance level tests, those of the SVDD are variable and dependent on the significance levels set by the tests. Therefore, whenever the significance level changes during tests, the SVDD has to be rerun which can produce unwanted inconvenience in controling test levels. 5.2 Two-class Classification Experiments It is well-known that class-conditional density estimation may not be optimal for the purpose of classification (Vapnik, 1998). However to make the investigation more complete, this section further assesses the RSDE’s performance for binary classification. 5.2.1 Ripley’s Synthetic Data 1

1

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

0

−1

−0.5

0

0.5

−1

−0.5

0

0.5

Fig. 1. Left hand plot: RSDE density estimation based classifier whose reduced set samples are encircled; Right hand plot: Parzen density estimation based classifier. In both cases, the decision boundary is shown in thick solid line, and the iso-contours of the density estimate for both classes are shown.

To illustrate the classification results visually, Ripley’s synthetic 2-D data (Ripley, 1996) are employed to test the performance of the density estimation based two-class classifier specified in Sec. 4. Ripley’s data include two classes with altogether 250 training samples and 1000 testing samples, which were generated from mixtures of two Gaussians with the classes overlapping to the extent that the Bayes error is around 8%. In experiments, the RSDE and the Parzen window density estimators are utilised and compared where the kernel widths of the Parzen window and the RSDE density estimators are selected using the same method as indicated in novelty detection experiments. The classification results for the training set are shown in Fig. 1. The test error for the RSDE is 9.2% which is only slightly superior to 9.4% of the Parzen window, but the RSDE has the remarkable advantage for the test complexity. 10

Only 13 samples in the reduced set are needed for the RSDE classifier while all 250 training samples are required for the Parzen window classifier. 5.2.2 R¨atsch’s Real Dataset Collections Further experiments of the RSDE classifier are carried out on the subset of R¨atsch’s real dataset collections that were used in novelty detection experiments in Sec. 5.1.2. RSDE classification experimental results are shown in Tab. 3, where test results of the SVM classifier are calculated and quoted from R¨atsch’s online repository, and the Parzen window classifier is employed for comparison. The average number of nonzero points (points used for test) of the Parzen window and the RSDE classifiers over 10 splits are also listed. Table 3 Classification experimental results on real data sets. Test error (Mean ± STD)% Data set

N

d

RSDE

PW

Banana

400

2

11.18±0.62

Titanic

150

3

22.75±0.41

Diabetes

468

8

Breast Cancer

200

German

700

Nonzero points

SVM

RSDE

PW

10.82±0.48

11.68±0.79

65.7

400

22.16±0.42

22.10±0.61

77.3

150

28.53±1.81

25.80±1.66

23.50±1.49

5.4

468

9

30.65±4.51

26.49±3.07

28.57±4.29

7.9

200

20

28.80±1.91

23.80±2.64

22.50±1.41

28

700

As in Sec. 5.1.2, Jarque-Bera test and T -test are carried out to compare test errors of all three classifiers. For all data sets, T -test (α = 0.01) results show no significant difference between the RSDE and the Parzen window classifiers, and the SVM is superior statistically for Diabetes and German data sets. Meanwhile, on average the RSDE classifier reduces test computational costs by approximately 76%.

6

Discussion

This section gives some intuitive discussions and explanations to understand why the RSDE achieves better performance in novelty detection than in binary classification. In novelty detection, an outlier is tested by setting a threshold for the target density estimate. Because the reduced set obtained by the RSDE can well represent the target density distribution that is optimal in the L2 sense with the true distribution, the RSDE based novelty detector achieves good results. In the classification case, the distance between two classes are more important than how exactly an individual class is distributed. Therefore, the SVM classi11

fier that concentrates on measuring the distance between two classes generally achieves better classification performance than finite sample density estimate based classifiers including the Parzen window and the RSDE classifiers.

7

Conclusion

This paper investigated the application of the recently proposed Reduced Set Density Estimator (RSDE) (Girolami and He, 2003) to novelty detection and binary classification and provided empirical comparisons with the Parzen window and SVM (Vapnik and Mukherjee, 2000; Tax and Duin, 1999) approaches. Experimental results indicate that the RSDE based novelty detector and binary classifier both achieve statistically similar accuracy as those based on the full-size Parzen window density estimator while reducing computational costs for the test by 65% to 80% on average. The RSDE density estimation based novelty detector also outperforms the non-density-estimation based SVDD (Tax and Duin, 1999) method which has limitations to some particular data sets and increases test computational costs and inconvenience when performing different significance level tests. The distance-measure based SVM classifier (Vapnik and Mukherjee, 2000) shows slightly superior performance than the two density estimate based classifiers considered.

Acknowledgements This work is supported by Scottish Higher Educational Funding Council Research Development grant ’INCITE’ http://www.incite.org.uk. A full Matlab implementation of RSDE and example data sets are available at http://cis.paisley.ac.uk/giro-ci0/reddens.

References Anderson,T., 1958. An Introduction to Multivariate Statistical Analysis. Wiley, New York. Barnett, V., Lewis, T., 1977. Outliers in Statistical Data. Wiley, New York. Campbell, C., Bennett, K., 2001. A linear programming approach to novelty detection. In: T.K. Leen et al., Eds., Advances in Neural Information Processing Systems, MIT Press 13, 395 - 401. Efron, B., Tibshirani, R., 1993. An Introduction to the Bootstrap. Chapman and Hall, London. 12

Girolami, M., He, C., 2003. Probability density estimation from optimally condensed data samples. IEEE Transactions on Pattern Analysis and Machine Intelligence 25(10), 1253 - 1264. Holmstr¨om, L., 2000. The error and the computational complexity of a multivariate binned kernel density estimator. Journal of Multivariate Analysis 72(2), 264 - 309. Judge, G.G., Hill, R.C., Griffiths, W.E. et al, 1988. Introduction to the Theory and Practice of Econometrics. Wiley, New York. McLachlan, G., Peel, D., 2000. Finite Mixture Models. Wiley, New York. Metz, C., 1978. Basic principles of ROC analysis. Seminars in Nuclear Medicine 8(4), 283 - 298. Mitra, P., Murthy, C., Pal, S., 2002. Density based multiscale data condensation. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(6), 734 - 747. R¨atsch, G., Onoda, T., M¨ uller, K., 2001. Soft margins for adaboost. Machine Learning 42(3), 287 - 320. Ripley, B., 1996. Pattern Recognition and Neural Networks. Cambridge University Press, Cambridge, UK. Roberts, S., 2000. Extreme value statistics for novelty detection in biomedical signal processing. IEE Proceedings Science, Technology and Measurement 47(6), 363 - 367. Sch¨olkopf, B., Platt, J., Shawe-Taylor, J., et al., 2001. Estimaing the support of a high-dimensional distribution. Neural Computation 13, 1443 - 1471. Scott, D., 1999. Remarks on fitting and interpreting mixture models. Computing Science and Statistics 31, 104 - 109. Scott, D. Sheather, S., 1985. Kernel density estimation with binned data. Communications in Statistics - Theory and Methods 14, 1353 - 1359. Sha, F., Saul, L., Lee, D.D., 2002. Multiplicative Updates for Non-Negative Quadratic Programming in Support Vector Machines, Technical Report MSCIS-02-19, University of Pennsylvania. Tax, D., Duin, R., 1999. Support vector data description. Pattern Recognition Letters 20(11-13), 1191 - 1199. Vapnik, V.N., 1998. Statistical Learning Theory. Wiley, New York. Vapnik, V., Mukherjee, S., 2000. Support vector method for multivariate density estimation. In: S. Solla et al., Eds., Advances in Neural Information Processing Systems, MIT Press, 659 - 665. Wang, S., Woodward, W., Gray, H., et al., 1997. A new test for outlier detection from a multivariate mixture distribution. Journal of Computational and Graphical Statistics 6, 285 - 299. Weston, J., Gammerman, A., Stitson, M.O. et al., 1999. Support vector density estimation. Advances in Kernel Methods, MIT Press.

13