Adaptive One-Class Support Vector Machine

Comment

Report 4 Downloads 43 Views

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 6, JUNE 2011

[6] J. Eriksson and V. Koivunen, “Complex-valued ICA using second order statistics,” in Proc. IEEE Int. Workshop Mach. Learn. Signal Process. (MLSP), Sao Luis, Brazil, 2004, pp. 183–192. [7] E. Ollila and V. Koivunen, “Complex ICA using generalized uncorrelating transform,” Signal Process., vol. 39, no. 4, pp. 365–377, Apr. 2009. [8] S. Javidi, B. Jelfs, and D. P. Mandic, “Blind extraction of noncircular complex signals using a widely linear predictor,” in Proc. IEEE/SP 15th Workshop Statist. Signal Process., 2009, pp. 501–504. [9] D. T. Pham, “Blind separation of instantaneous mixture of sources via the Gaussian mutual information criterion,” Signal Process., vol. 81, pp. 855–870, 2001. [10] S. Degerine and A. Zaidi, “Separation of an instantaneous mixture of Gaussian autoregressive sources by the exact maximum likelihood approach,” IEEE Trans. Signal Process., vol. 52, no. 6, pp. 1492–1512, Jun. 2004. [11] B. Picinbono, “On circularity,” IEEE Trans. Signal Process., vol. 42, no. 12, pp. 3473–3482, Dec. 1994. [12] B. Picinbono and P. Bondon, “Second-order statistics of complex signals,” IEEE Trans. Signal Process., vol. 45, no. 2, pp. 411–420, 1997. [13] W. Xiong, H. Li, T. Adalı, Y.-O. Li, and V. D. Calhoun, “On entropy rate for the complex domain and its application to i.i.d. sampling,” IEEE Trans. Signal Process., vol. 58, no. 4, pp. 2409–2414, 2010. [14] T. Adalı, H. Li, M. Novey, and J. F. Cardoso, “Complex ICA using nonlinear functions,” IEEE Trans. Signal Process., vol. 56, no. 9, pp. 4536–4544, Sep. 2008. [15] X.-L. Li and T. Adalı, “Blind spatiotemporal separation of second and/or higher-order correlated sources by entropy rate minimization,” presented at the IEEE Int. Conf. Acoust., Speech, Signal Process. (IEEE ICASSP), Dallas, TX, 2010. [16] G. Gomez-Herrero, K. Rutanen, and K. Egiazarian, “Blind source separation by entropy rate minimization,” IEEE Signal Process. Lett., vol. 17, no. 2, pp. 153–156, 2010. [17] J. F. Cardoso, “On the performance of orthogonal source separation algorithms,” in Signal Process. VII, Proc. Eur. Assoc. Signal Process., Edinburgh, Scotland, 1994, pp. 776–779. [18] B. Picinbono, “Wide-sense linear mean square estimation and prediction,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (IEEE ICASSP), 1995, pp. 2032–2035. [19] T. M. Cover and J. A. Thomas, Elements of Information Theory. New York: Wiley, 1991. [20] X.-L. Li and X.-D. Zhang, “Nonorthogonal joint diagonalization free of degenerate solution,” IEEE Trans. Signal Process., vol. 55, no. 5, pp. 1803–1814, May 2007. [21] A. Yeredor, “Non-orthogonal joint diagonalization in the least-squares sense with application in blind source separation,” IEEE Trans. Signal Process., vol. 50, no. 7, pp. 1545–1553, Jul. 2002. [22] O. Macchi and E. Moreau, “Self-adaptive source separation by direct or recursive networks,” in Proc. Int. Conf. Digital Signal Process., Limasol, Cyprus, 1995, pp. 122–129. [23] S. Amari, A. Cichocki, and H. H. Yang, “A new learning algorithm for blind signal separation,” in Advances in Neural Information Processing Systems 1995. Boston, MA: MIT Press, 1996, pp. 752–763. [24] M. Wax and T. Kailath, “Detection of signals by information theoretic criteria,” IEEE Trans. Acoust., Speech, Signal Process., vol. 33, no. 2, pp. 387–392, 1985. [25] X.-L. Li, M. Anderson, and T. Adalı, “Principal component analysis for noncircular signals in the presence of circular white Gaussian noise,” in Proc. 2010 Asilomar Conf. Signals, Syste., Comput., Pacific Grove, CA, 2010.

2975

Adaptive One-Class Support Vector Machine Vanessa Gómez-Verdejo, Jerónimo Arenas-García, Member, IEEE, Miguel Lázaro-Gredilla, Member, IEEE, and Ángel Navia-Vázquez, Senior Member, IEEE

Abstract—In this correspondence, we derive an online adaptive one-class support vector machine. The machine structure is updated via growing and pruning mechanisms and the weights are updated using structural risk minimization principles underlying support vector machines. Our approach leads to very compact machines compared to other online kernel methods whose size, unless truncated, grows almost linearly with the number of observed patterns. The proposed method is online in the sense that every pattern is only presented once to the machine and there is no need to store past samples and adaptive in the sense that it can forget past input patterns and adapt to the new characteristics of the incoming data. Thus, the characterizing properties of our algorithm are compactness, adaptiveness and real-time processing capabilities, making it especially well-suited to solve online novelty detection problems. Regarding algorithm performance, we have carried out experiments in a time series segmentation problem, obtaining favorable results in both accuracy and model complexity with respect to two existing state-of-the-art methods. Index Terms—Adaptive methods, one-class SVM, online-novelty detection.

I. INTRODUCTION In recent years, there has been an increasing interest in the application of machine learning techniques to the detection of rare or unseen patterns [1]. Sometimes, data representing the “normal behavior” of the system are time-varying, thus adaptive novelty detectors are required. This is the case in applications such as intrusion detection [2], audio and speech segmentation [3], [4] or wireless sensor networks [5]. One-class support vector machines (1-SVMs) [6], [7] have been introduced for novelty detection systematically achieving good results [3], [4], [8]. The underlying idea of 1-SVMs is to model the support of the distribution, i.e., the input region where “normal” data are expected to lie. Traditional 1-SVMs assume a batch formulation, so they cannot be directly applied to online novelty detection. Instead, it is necessary to derive adaptive schemes that update the model in an online fashion, forgetting the old patterns that no longer represent the current behavior of the system. Many authors have proposed online versions of margin-based algorithms. The first attempts [9]–[11] iteratively update the model by incorporating every training example that is either misclassified or classified with an insufficient margin, thus indefinitely increasing model complexity. Other techniques [12]–[15] try to solve this drawback by also removing data as they become redundant. This procedure is aimed at incremental learning (a fast alternative to batch SVM), and it is not suited to an adaptive scenario in which the solution has to be able to track changes of the data distribution and to forget oldest patterns. Manuscript received July 08, 2010; revised October 13, 2010, December 31, 2010, and March 02, 2011; accepted March 02, 2011. Date of publication March 10, 2011; date of current version May 18, 2011. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Konstantinos Slavakis. Their work was supported in part by MEC Grant TEC2008-02473 (Spanish Goverment). The authors are with the Department of Signal Theory and Communications, Universidad Carlos III de Madrid, Madrid 28911, Spain (e-mail: {vanessa@tsc. uc3m.es; [email protected]; [email protected]; [email protected]). Digital Object Identifier 10.1109/TSP.2011.2125961

1053-587X/$26.00 © 2011 IEEE

2976

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 6, JUNE 2011

In [16], an SVM-based adaptive scheme for novelty detection is proposed. This method iteratively compares two 1-SVMs trained using only the data contained in two sliding windows, which respectively precede and follow the present instant. When the resulting machines are very different, a change in the statistics of the time series is likely to have occurred. However, this 1-SVM formulation is not inherently online, since it requires repeated batch training of new machines, thus incurring in a high computational cost and introduces a delay in the processing. In [5], an adaptive 1-SVM tailored to the detection of outliers in wireless sensor networks is proposed. This approach resorts to similar tricks as [16] to obtain adaptivity, thus incurring in the same drawbacks. The updates are however faster in this case because a simpler, one-class quarter-sphere SVM [17] is used. A proper online and adaptive implementation of 1-SVMs can be found in [18], where the NORMA family of algorithms is presented. These algorithms work by minimizing the traditional SVM functional using stochastic gradient descent updates and include pruning strategies to control the size of the machine. In this paper we build on the work of [19], which proposes an iterative weighted recursive least squares (IW-RLS) algorithm, to obtain an online adaptive 1-SVM formulation. The new Adaptive One-class Support Vector Machine (AOSVM) algorithm weighs the error associated to each pattern using an exponential window, so that old patterns have a smaller influence in the SVM cost functional. The proposed method is not only adaptive, in the sense that it forgets the past and learns the new behavior of the system, but also online: AOSVM implements a compact solution of the SVM cost functional that can be updated at each time step using only the newly available data and the previous solution. Furthermore, this compact formulation keeps both the computational and memory requirements of the algorithm under the designer’s control. The rest of the paper is organized as follows. Section II introduces the AOSVM algorithm and discusses the issues involved in online SVM implementations. In Section III several time series segmentation problems are considered to compare the behavior of the new proposed method against two state-of-the-art approaches. Finally, in Section IV we present our conclusions and discuss possible future research. II. ADAPTIVE ONE-CLASS SUPPORT VECTOR MACHINE A. Adaptive One-Class SVM Problem Description Given a set of training patterns , the standard (batch) one-class support vector machine (1-SVM) (proposed by Schölkopf et al. in [6]) produces an output which takes positive values in the region where most of the training patterns lie and negative values elsewhere. Following [6], let us define a nonlinear map which projects training data points into a dot product space called feature space.1 Then, the 1-SVM’s output is defined by , where defines a hyperplane in feature space separating the coordinate origin from the projections of training data. Hyperplane is selected so as to maximize the soft-margin, as defined by the following convex quadratic programming problem: (1) (2) where are slack positive variables and tween structural and empirical risk.

controls the trade-off be-

1In the kernel literature, the selection of is normally reformulated into selection of the kernel function , i.e., the inner product in . Some frequent nonlinear kernels are the Gaussian, polynomial and hyperbolic tangent ones [20].

In many applications, data become available on a one-at-a-time basis, producing a time dependent training data set: . Therefore, a good algorithm should be able to include new information without retraining from scratch every time a new pattern appears. Furthermore, the algorithm should be adaptive, i.e., able to track the changes of the optimal solution, not only learning from new data, but also forgetting the past patterns that no longer reflect the present behavior of the system. The proposed AOSVM algorithm incorporates an exponential weighting window in the empirical risk term of (1), so that it is able to track slow changes on the distribution of the patterns that represent the “normal behavior” of the system, while still detecting abrupt changes in the distribution of the data. Concretely, our approach for online 1-SVM consists on solving at time the following optimization problem: (3) (4) Note that since the training data and the cost function are time dependent, so are the values for the slack variables associated to the different patterns. Parameter is a forgetting factor (usually very close to 1) that weighs the influence of the different patterns in the empirical risk minimization term of (3) using an exponentially decaying window, so that old patterns have less influence. Equations (3) and (4) clearly determine the 1-SVM problem statement, but to solve it, we first have to use Lagrange multipliers to incorporate the constraints into the objective function, arriving at the modified functional

(5) which has to be minimized with respect to primal variables and and maximized with respect to and , with constraints . From the theory of Lagrange multipliers, we know that the following set of equalities, known as the Karush–Kuhn–Tucker (KKT) conditions, are the necessary and sufficient conditions that completely specify the solution to the problem. (6) (7) (8) (9) Introducing (6) and (7) back into (5), a convex maximization is obtained. This corproblem involving only the dual variables responds to the standard formulation of the 1-SVM and can be easily solved using quadratic programming. Unfortunately, this approach is unfeasible in an adaptive scenario due to its high computational cost. The next subsection presents an alternative algorithm to solve the 1-SVM problem more efficiently, that will be named as AOSVM and can be considered as an extension to the 1-SVM case of the approach presented in [19] B. Iterated Re-Weighted Least Squares Solution to the AOSVM The optimization method based on iterated re-weighted least squares (IRWLS) is a well established technique for training sup-

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 6, JUNE 2011

TABLE I PSEUDOCODE FOR THE ADAPTIVE ONE-CLASS SVM ALGORITHM (AOSVM)

2977

where is the kernel function to compute inner products in a transformed space . Then, (12) can be computed . This using as many kernel evaluations as the number of nonzero is undesirable for online applications in which the number training data grows at each time step. To overcome the problem of unbounded size, we propose to use a , namely compact approximation of where is a set of vectors in the feature space. The introduction of this compact approximation allows us to rewrite the novso that elty detection function as requires only kernels.2 Then, the AOSVM the evaluation of problem can be solved with this compact approximation by defining and and replacing in (11) by , arriving to

(13) port vector machines that we have been developing during the last years [19], [21]–[23]. To derive the IRWLS solution for this particular problem, let us first regroup the terms in (5) as follows:

where and are kernel matrices. To solve for , we take derivatives of and use the fact that the gradient should cancel at the solution, leading to (14)

(10) where we have defined the nonnegative constants

with

. The nonnegativity of is easily checked , while for , we have from (8) that , so that in this case and, consequently, . It may seem that is ill-defined when ; however, this is a well studied situation and the particular and detailed equations needed to implement the IRWLS algorithm (circumventing the problem of division by zero) have already been derived in [19] and cannot be reproduced here due to space restrictions. Nevertheless, the algorithm description in Table I includes the correct expressions to avoiding such ill-definition. compute the From KKT condition (6), we know that the last term in (10) vanishes at the solution. Actually, this term is forced to be zero at every iteration of the algorithm, since KKT condition (6) is included in the definition in Table I (a more extensive explanation about the details of of the algorithm can be found in [19], [21], and [22] and a convergence proof is given in [23]). The minimization of (10) is just a regularized weighted least squares problem, that can be reformulated using matrix notation as for

The selection of the vectors should be carried out so that is a good approximation of or, in other words, the space spanned by should be as close as possible to that spanned by the projections of the SVs. In classical applications, it is possible to use selection strategies based on any clustering algorithm, or more sophisticated methods [19], [28]. In Section II-E, we will discuss how to address this issue in an adaptive scenario. D. Online Adaptation of the AOSVM Solution In this paper we are interested in algorithms that can be adapted online, in the sense that we can obtain the solution at time using only and the new pattern. However, computing the solution at time (14) requires to use, at each step, all training patterns. In this subsection we will reformulate (14) so that AOSVM enjoys the nice property we have just mentioned. To start with, let us define and that can be used to rewrite (14) as so that it is clear that the online computation of both and is needed to get an online AOSVM implementation. is just a modified version of the autocorrelation matrix First, of the training data. Expanding terms, we get (15)

(11) where and

, , is an all-ones column vector.

C. A Compact AOSVM Formulation From the Representer Theorem [20], we know that the solution of (11) can be written as an expansion in terms of the projected training , so that the 1-SVM function besamples, comes (12)

where . Under the assumption that the solution provided by our algorithm varies slowly (which is a reasonable assumption in slowly varying scenarios), we so that it is also possible to approximate have (16) 2When using a Gaussian kernel, the resulting topology is similar to that of an RBF network. Nevertheless, it is important to note that AOSVM implements a maximum margin criterion and that any kernel satisfying Mercer’s theorem [20] could be used in our method. Kernels without a radial structure, such as the polynomial kernel, can also be employed if a specific application demands it, see for instance [24]–[27].

2978

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 6, JUNE 2011

where the second equality makes use of KKT conditions (7) and (9). Inusing troducing (16) into (15), we can update . Finally, using the matrix inversion lemma and knowing that both , are symmetric, we have and its inverse, (17)

. can be modified to Note that AOSVM requires to know approximate by adding zero mean Gaussian noise with covarito . ance matrix we will apply Secondly, to obtain an expression for adapting again the assumption about a slowly changing solution. Then, we can approximate (18)

E. Semiparametric Model Growing and Pruning Method In the previous section, we have assumed that the semiparametric model was known. In the following we will explain how to dynamically estimate and update its structure. As we discussed in Section II-C, can accurately approximate the optimal weights if the subspaces spanned by the SVs and by the centroids are similar. Therefore, we will in as a sufficient base at time consider a set of vectors as a linear combination of the step if the approximation of [28], i.e., elements in the base incurs in a error below a threshold

which can be rewritten as (19) The optimal representation coefficients can be calculated as and the corresponding representation error as , then . If cannot be correctly represented as a function of the already existing elements in the base, and it has to be incorporated to the base so that it can be represented in the model. As more and more elements are added to the base, it may well happen that some of the elements of the base become under-used. For economy purposes, it is important to identify and remove them from the base, since they become useless and represent a computational overhead. We propose to compute the normalized accumulated optimal , up to time as projection coefficients , being the optimal coefficients for representing pattern . To identify the less relevant base elements, we is below a given pruning just evaluate if any of the elements of and eliminate it from the base. If such is the case, we threshold will also have to modify , and by removing the rows and/or columns associated to the pruned element. Finally, to avoid outliers becoming part of the model, we will not immediately incorporate patterns to the base but rather we will store them in a “bag” of potential candidates. As new patterns are presented, we compute their projection onto the base and the “bag of candidates” vector for both base and candidates in the bag: and we build the and . If after some steps, the value of one of the , it means that it is really patterns in the bag is above threshold

a pattern that deserves to be incorporated into the base. This way, we do not prevent relevant patterns from being incorporated to the base and on the other hand, we avoid structural instabilities by adding to the model new base elements (extra dimensions) that would be pruned away shortly afterwards. With all this in mind, we summarize in Table I the pseudocode for the online implementation of the AOSVM algorithm. Note that the definition of includes regularization to avoid division by zero (see [19] for further details). III. EXPERIMENTS In the following, we evaluate the performance of the proposed method in a time series segmentation setup, representative of a larger set of novelty detection scenarios. We have selected several time series for experimentation purposes: the first one is the result of concatenating two different stochastic processes (“Two-processes”), the second comprises two sine waves with different frequencies (“Two sines”), the third one is analogous to the second, but the second sine wave has been replaced by a Gaussian noise component (“Noise in sine”) and the fourth one is the well-known laser time series [8] (“Laser”). The novelty detection algorithm is fed with a preprocessing of the input series. The preprocessing for each series has been selected as follows: For the first three problems, “Two-processes”, “Two sines” and “Noise in sine”, the Linear prediction coefficients (order 5th) of the series have been used; for the laser series, we have instead used a simple maximum-minimum envelope (at every time instant two outputs are computed: a maximum and a minimum value on a sliding window). For representational purposes, we have also normalized the output to the range (0,1) using a sigmoid. of the AOSVM machine Then, values close to 1 represent little novelty in the incoming data, whereas values close to 0 mean high degree of novelty; in this way, a novel pattern is detected when is below a given threshold. In this section, we compare the performance of AOSVM against to other methods: 1) NORMA: A kernel approach taken from [18]. This algorithm is nonparametric in nature, therefore producing very large machines. As described in [18], a truncated version of NORMA can be obtained by retaining only the last terms. We use the truncated NORMA in our experiments, selecting so as to produce no noticeable performance loss. A complexity comparison with the standard NORMA is also included. 2) SDEM (“Sequentially discounting EM algorithm”): A semiparametric algorithm proposed in [29]. This method models data using a Gaussian mixture model and iteratively updates the model parameters with an exponentially decaying window. The outcome of this method depends on the initialization of the Gaussian centroids, so we have repeated each experiment 1000 times with different random initializations and analyzed the possible resulting behaviors. All algorithms have been tested in the same conditions, using the same pre- and post-processing and their parameters, for each time series, have been selected according to the following process. We have run a number of experiments with some synthetic time series, generating different time series for validation and test, to identify the range of parameters that lead to reasonable solutions and we have selected hyperparameters in the middle of such range. We have observed that the proposed method is rather robust with respect to hyperparameter selection: a modification of the hyperparameters manifests in a larger or smaller model, a steeper response or a reduced noise in the output, but a reasonable response identifying the series change is produced anyway and it is not necessary to tune the hyperparameters with high precision.

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 6, JUNE 2011

Fig. 1. (a) “Two-processes” time series; (b) AOSVM output; (c) NORMA output; (d), (e), and (f) SDEM output in 40.2%, 35.5%, and 24.3% of the experiments, respectively.

Results on the “Two-processes” series are shown in Fig. 1(b)–(f) for the AOSVM, NORMA, and SDEM algorithms, respectively. SDEM output is given in log scale for representational purposes. At the bottom

2979

Fig. 2. (a) “Two-sines” time series; (b) AOSVM output; (c) NORMA output; (d), (e), and (f) SDEM output in 35.3%, 53.5%, and 11.2% of the experiments, respectively.

of every figure we have plotted small spikes indicating the optimal transition points. We can check that AOSVM is able to correctly detect the , with a clear spike. Note that the variation process change at

2980

Fig. 3. (a) “Noise-in-sine” time series; (b) AOSVM output; (c) NORMA output; and (d) SDEM output.

of the output at the process change is very large, so that threshold seAOSVM forgets lection is not critical. Note also how after past values of the time series and constructs a new model for its new normal behavior. On the contrary, NORMA not only fails to detect the change, but it behaves abnormally during the first part of the series, so that no useful threshold can be set for detection in this case. Regarding SDEM, we have observed three different behaviors depending on initialization. In 40.2% of the 1000 experiments the method correctly detected the transition, although the observed output was noisier than for AOSVM. For 35.5% of the initializations a false positive detection was produced and in the remaining runs no transition point was detected. Fig. 1(d)–(f) shows a typical run of each of these possible outcomes. Results on the “Two-processes” series are shown in Fig. 1(b)–(f) for the AOSVM, NORMA, and SDEM algorithms, respectively. SDEM output is given in log scale for representational purposes. At the bottom of every figure we have plotted small spikes indicating the optimal transition points. We can check that AOSVM is able to correctly detect the , with a clear spike. Note that the variation process change at of the output at the process change is very large, so that threshold seAOSVM forgets lection is not critical. Note also how after past values of the time series and constructs a new model for its new normal behavior. On the contrary, NORMA not only fails to detect the change, but it behaves abnormally during the first part of the series, so

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 6, JUNE 2011

Fig. 4. (a) “Laser” time series; (b) AOSVM output; (c) NORMA output; and (d) SDEM output.

that no useful threshold can be set for detection in this case. Regarding SDEM, we have observed three different behaviors depending on initialization. In 40.2% of the 1000 experiments the method correctly detected the transition, although the observed output was noisier than for AOSVM. For 35.5% of the initializations a false positive detection was produced, and in the remaining runs, no transition point was detected. Fig. 1(d)–(f) shows a typical run of each of these possible outcomes. In the “Two sines” case, the results for AOSVM and NORMA are presented in Fig. 2(b) and (c). We observe again that AOSVM responds with two sharp peaks, NORMA producing a much noisier profile and with less marked peaks. The fact that the output of the NORMA algorithm is noisier than that of AOSVM may be due to NORMA being a stochastic gradient descent based method, while AOSVM is obtained through a recursive least squares (RLS) formulation. Again, SDEM is highly dependent on the initialization. Both transitions are correctly detected only in 35.5% of the cases (and even those typically include false positives). A single transition point was identified in 53.5% of the runs and no transitions at all in the remaining 11.2%. The typical outputs for each of these cases are shown in Fig. 2(d)–(f). For the “Noise in sine” time series (Fig. 3), all methods achieve reasonably good performance, but NORMA and SDEM show noisier profiles that can result in some wrong change detections. For this problem,

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 6, JUNE 2011

2981

TABLE II AVERAGE MACHINE SIZE FOR NORMA, NORMA TRUNCATED, AND AOSVM IN EVERY TIME SERIES

[4] A. Gretton and F. Desobry, “On-line one-class support vector machines. An application to signal segmentation,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (IEEE ICASSP), Hong Kong, 2003, vol. 2, pp. 709–712. [5] Y. Zhang, N. Meratnia, and P. Havinga, “Adaptive and online one-class support vector machine-based outlier detection techniques for wireless sensor networks,” in Proc. Adv. Inf. Networking Appl. (AINA), 2009, pp. 990–995. [6] B. Schölkopf, R. Williamson, A. J. Smola, J. Shawe-Taylor, and J. Platt, “Support vector method for novelty detection,” in Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press, 2000, vol. 12. [7] D. M. J. Tax and R. P. W. Duin, “Support vector data description,” Mach. Learn., vol. 54, no. 1, pp. 45–66, 2004. [8] J. Ma and S. Perkins, “Time-series novelty detection using one-class support vector machines,” in Proc. Int. Joint Conf. on Neural Networks (IJCNN), Portland, OR, 2003, pp. 1741–1745. [9] Y. Li and P. M. Long, “The relaxed online maximum margin algorithm,” Mach. Learn., vol. 46, no. 1–3, pp. 361–387, 2002. [10] C. Gentile, “A new approximate maximal margin classification algorithm,” J. Mach. Learn. Res., vol. 2, pp. 213–242, 2001. [11] K. Crammer and Y. Singer, “Ultraconservative online algorithms for multiclass problems,” J. Mach. Learn. Res., vol. 3, pp. 951–991, 2003. [12] A. Bordes, S. Ertekin, J. Weston, and L. Bottou, “Fast kernel classifiers with online and active learning,” J. Mach. Learn. Res., vol. 6, pp. 1579–1619, 2005. [13] A. Bordes and L. Bottou, “The Huller: A simple and efficient online SVM,” in Machine Learning: ECML 2005, Lecture Notes in Artificial Intelligence, LNAI 3720, 2005, pp. 505–512. [14] S. K. P. Laskov, C. Gehl, and K. R. Mller, “Incremental support vector learning: Analysis, implementation and applications,” J. Mach. Learn. Res., vol. 7, pp. 1909–1936, 2006. [15] S. Shalev-Schwartz, Y. Singer, and N. Srebro, “Pegasos: Primal estimated sub-GrAdient solver for SVM,” in Proc. 24th Int. Conf. Mach. Learn., 2007, pp. 807–814. [16] F. Desobry, M. Davy, and C. Doncarli, “An online kernel change detection algorithm,” IEEE Trans. Signal Process., vol. 53, no. 8, pp. 2961–2974, 2005. [17] P. Laskov, C. Schfer, and I. Kotenko, “Intrusion detection in unlabeled data with quarter-sphere support vector machines,” in Proc. Detect. Intrusions Malware Vulnerability Assess., 2004, pp. 71–82. [18] J. Kivinen, A. J. Smola, and R. C. Williamson, “Online learning with kernels,” IEEE Trans. Signal Process., vol. 52, no. 8, pp. 2165–2176, Aug. 2004. [19] A. Navia-Vázquez, F. Pérez-Cruz, A. Artés-Rodríguez, and A. R. Figueiras-Vidal, “Weighted least squares training of support vector classifiers leading to compact and adaptive schemes,” IEEE Trans. Neural Netw., vol. 12, no. 5, pp. 1047–1059, 2001. [20] B. Schölkopf and A. J. Smola, Learning With Kernels. Cambridge, MA: MIT Press, 1998. [21] F. Pérez-Cruz, P. Alarcón-Diana, A. Navia-Vázquez, and A. Artés-Rodríguez, “Fast training of support vector classifiers,” Adv. Neural Inf. Process. Syst., vol. 13, pp. 734–740, 2001. [22] E. Parrado-Hernández, I. Mora-Jiménez, J. Arenas-García, A. R. Figueiras-Vidal, and A. Navia-Vázquez, “Growing support vector classifiers with controlled complexity,” Pattern Recognit., vol. 36, no. 7, pp. 1479–1488, 2003. [23] F. Pérez-Cruz, C. B. no Calzón, and A. Artés-Rodríguez, “Convergence of the IRWLS procedure to the support vector machine solution,” Neural Comput., vol. 17, pp. 7–18, 2005. [24] H. Hoffmann, “Kernel PCA for novelty detection,” Pattern Recognit., vol. 40, no. 3, pp. 863–874, 2007. [25] H. Escalante and O. Fuentes, “Kernel methods for anomaly detection and noise elimination,” in Proc. Int. Conf. Comput. (CORE), 2006, pp. 69–80. [26] B. Schölkopf, S. M. C. Burges, P. Knirsch, K. Müller, G. Rätsch, and A. Smola, “Input space versus feature space in kernel-based methods,” IEEE Trans. Neural Netw., vol. 10, no. 5, pp. 1000–1017, 1999. [27] B. Schölkopf, A. Smola, and K. Müller, “Kernel principal component analysis,” in Proc. Int. Conf. Artificial Neural Netw. (ICANN), 1997, vol. 1327, pp. 583–588. [28] Y. Engel, S. Mannor, and R. Meir, “The kernel recursive least-squares algorithm,” IEEE Trans. Signal Process., vol. 52, no. 8, pp. 2275–2285, Aug. 2004. [29] K. Yamanishi, J. Takeuchi, G. Williams, and P. Milne, “On-line unsupervised outlier detection using finite mixtures with discounting learning algorithms,” in Proc. 6th ACM SIGKDD, 2000, pp. 320–324.

SDEM output is less sensitive to the initialization parameters and no significant differences between runs were observed. Similar conclusions can be achieved in the “Laser” case (Fig. 4): we can see that AOSVM produces again much clearer peaks than NORMA, thus simplifying threshold selection. SDEM was unable to detect the transition . point at instant In order to compare model complexity, we provide in Table II the average size attained by AOSVM, NORMA and truncated NORMA on the considered data sets. We observe that the direct application of NORMA leads to very large machines, using almost one support vector per sample. Using a trunresults in more reasonable values (around cation parameter 40 or 45 support vectors on average), without decreasing performance. On the other hand, the AOSVM procedure yields much more compact results, with only three or four kernel computations being required on average. In short, the performance of these algorithms allows us to conclude that they indeed enjoy adaptive properties, learning the new behavior of the time series. The AOSVM RLS formulation generally produces a higher quality output, simplifying the selection of the novelty detection threshold. Its compact formulation implies that the machine complexity is bounded, thereby reducing the computational and memory requirements. NORMA and SDEM algorithms, on the other hand, produce noisier outputs, thus being likely to incur in false positives. Additionally, SDEM performance critically depends on the initial centroid selection. IV. CONCLUSIONS AND FUTURE RESEARCH We have presented a semiparametric Adaptive One-class Support Vector Machine (AOSVM) algorithm and illustrated its performance in an online novelty detection setting for the task of time-series segmentation. The proposed method is characterized by its online and adaptive nature, as well as its bounded memory requirements. Experimental results show its capability to detect changes in a time series scenario, presenting smoother outputs than NORMA and SDEM approaches. Besides, the compact AOSVM formulation produces much smaller machines than NORMA (two orders of magnitude smaller with respect to the basic NORMA and one order of magnitude with respect to the truncated NORMA). As further work we propose to apply the AOSVM method to more realistic novelty detection application scenarios.

REFERENCES [1] V. Chandola, A. Banerjee, and V. Kumar, “Anomaly detection: A survey,” ACM Comput. Surv., vol. 41, no. 3, pp. 1–58, 2009. [2] D. Snyder, “Online intrusion detection using sequences of system calls,” Ph.D. dissertation, Comput. Sci. Dept., Florida State Univ., Tallahassee, FL, 2001. [3] M. Davy and S. Godsill, “Detection of abrupt spectral changes using support vector machines—an application to audio signal segmentation,” Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (IEEE ICASSP), pp. 1313–1316, 2002.

Recommend Documents

Adaptive Support Vector Machine for Time-Varying ... - Semantic Scholar

A WEIGHTED SUPPORT VECTOR MACHINE FOR DATA ...

evolutionary support vector machine for ... - Semantic Scholar