Adaptive Support Vector Machine for Time-Varying ... - Semantic Scholar

Report 6 Downloads 30 Views
Adaptive Support Vector Machine for Time-Varying Data Streams using Martingale (On the Detection of Concept Changes in Time-Varying Data Stream by Testing Exchangeability (UAI 2005))

Shen-Shyang Ho and Harry Wechsler Department of Computer Science, George Mason University email: {sho, wechsler}@cs.gmu.edu The Martingale Test

In this paper we propose an efficient adaptive support vector machine (SVM) for time-varying data streams based on the martingale approach [2] and using adiabatic incremental learning [1]. When a new data point is observed, hypothesis testing decides whether any change has occurred. Once a change is detected, historical information about previous data is removed from the memory. The adaptive support vector machine is a one-pass incremental algorithm that

Consider the simple null hypothesis H0 : “no concept change in the data stream” against the alternative H1 : “concept change occurs in the data stream”. The test continues to operate as long as

3. works well for high dimensional, multi-class data streams.

Background: Consider a set of labeled examples Z = {z1, · · · , zn−1} = {(x1, y1), · · · , (xn−1, yn−1)} where xi is an object and yi ∈ {−1, 1}, its corresponding label, for i = 1, 2, · · · , n−1. Assuming that a new labeled example, zn, is observed, testing exchangeability for the sequence of examples z1, z2, · · · , zn consists of two main steps [2]:

where λ is a positive number. One rejects the null hypothe(²) sis when Mn ≥ λ. Assuming that {Mk : 0 ≤ k < ∞} is a nonnegative martingale, the Doob’s Maximal Inequality states that for any λ > 0 and 0 ≤ n < ∞ , µ ¶ 1 P max Mk ≥ λ ≤ (4) k≤n λ This inequality means that it is unlikely for any Mk to have a high value.

The randomized power martingale is used to approxiQn (2) f (zi,H1) f (Z,H1) mate the likelihood ratio ln = i=1 f (zi,H0) = f (Z,H0) , also a martingale. We defined the new “likelihood ratio” for a particular point zi as ²p²i

f (zi, H1) r(zi) = ≈ f (zi, H0) pi

4 3

0

2000

4000

(1)

(2)

Test for Change Detection: Intuition Assuming that a sequence of data points with a concept change consists of concatenating two data segments, S1 and S2, such that the concepts of S1 and S2 are C1 and C2 respectively and C1 6= C2. Switching a data point zi from S2 to a position in S1 will make the data point stands out in S1. The exchangeability condition is, therefore, violated. By the Kolmogorov-Smirnov Test (KS-Test), the p-values are shown not to be distributed uniformly after the concept changes (Figure 2). The null hypothesis “the p-values output by (1) are uniformly distributed” is rejected at significance level α = 0.05, after sufficient number of data points are observed. The skewed p-value distribution plays an important role in our martingale test for change detection as small p-values inflate the martingale values.

Reference [1] G. Cauwenberghs and T. Poggio. Incremental support vector machine learning. In Advances in Neural Information Processing Systems 13, pages 409-415. MIT Press, 2000. [2] Vladimir Vovk, Illia Nouretdinov, and Alexander Gammerman. Testing Exchangeability on-line. In Tom Fawcett and Nina Mishra, editors, ICML, pages 768-775. AAAI Press, 2003.

10000

12000

14000

8 7 6 5 4 3 2

0 0

2000

4000

6000

8000

10000

12000

Data Stream

Figure 4: Categorical High Dimensional Data-set simulated data stream using a modified Nursery data set: The martingale values of the data stream with λ = 8: ∗ represents detected change point: one false alarm.

r(zi): ‘‘Likelihood Ratio’’

1.2 1 0.8

f(zi|H0)

0.6

10

f(zi|H1)

0.4 p = 0.35265

0.2

0.2

0.4

0.6

0.8

1

Figure 1: The characteristic of the new “likelihood ratio” r(zi) at a particular point zi, derived from the p-values with ² = 0.92. When ²p²i − pi = 0, pi = 0.35265. When pi < 0.35265, f (zi, H1) > f (zi, H0)

Martingale Values

5 0

0

(1 − β) log λ E(n) = E(L)

(6)

3000

4000

5000

6000

7000

1000

2000

3000

4000

5000

6000

7000

1000

2000

3000

4000

5000

6000

7000

5

Data Stream

(5)

where α is the desirable size and β is the probability of the type II error. When the alternative hypothesis H1: “concept change occurs in the data stream” is true, the mean delay time, i.e. expected value of n is:

2000

5

0

1−β λ≤ α

1000

10

The upper bound for λ,

i=1

where the pis are the p-values output by the randomized p(²) value function V , with the initial martingale M0 = 1.

8000

Figure 3: Numerical High Dimensional Data-set simulated data stream using Ringnorm and Wavenorm: The martingale values of the data stream using λ = 10: ∗ represents detected change point: four false alarm and one miss detection.

Figure 5: Multi-class High Dimensional Data-set simulated three-digit data stream using the USPS handwritten digit data set: The martingale values of the data stream using λ = 10. ∗ represents detected change point: one false alarm. Delay time are 45, 99 and 62.

Experiments: Adaptive SVM 100

where L=

log ²p²−1 i

(7)

1 0.9

P−values from KS−Test

0.8 0.7 0.6 Mean of uniformly distributed p−values 0.5 Mean of p−values computed using (2)

0.4 Change starts here 0.3

20 15

98 97 96 95 94

10 5 0 −5 −10 −15 −20

93

−25

92 −30 0 5 10 15 20 Difference from baseline accuracy (%)

0.2

Skewness

0.1 0 500

25

99

Percentage of Data Points (%)

A family of martingales, indexed by ² ∈ [0, 1], and referred to as the randomized power martingale, is defined as

6000

Data Stream

1.4

B. Construct the randomized power martingale

²pi

5

1

0 0

Mn(²) =

6

1

1.6

where αi is the strangeness measure for zi, i = 1, 2, · · · , n and θn is randomly chosen from [0, 1]. The p-values p1, p2, · · · output by the randomized p-value function V are distributed uniformly in [0, 1], provided that the input examples z1, z2, · · · are generated by an exchangeable probability distribution in the input space [2].

¢ ²−1

7

2

.

The randomized p-value of the set Z ∪ {zn} is define as

n Y ¡

8

The Martingale Test as an Approximation to Sequential Probability Ratio Test (SPRT)

A. Extract a p-value pn for the set Z ∪ {zn} from the strangeness measure deduced from a classifier

V (Z ∪ {zn}, θn) = #{i : αi > αn} + θn#{i : αi = αn} n

(3)

9

Accuracy Difference (%)

2. does not require monitoring the performance of the classifier as data points are streaming, and

0 < Mn(²) < λ

10

Martingale Values

1. does not require a sliding window on the data stream,

Experiments: Change Detection

Martingale Value

Introduction

750

1000

1250 Data Stream

1500

1750

2000

Figure 2: An Illustration: Data points are observed one by one from the 1st to the 2000th data point with concept change starting at the 1001th data point.

1831 3738 5461 Data Stream

Figure 6: Simulated three-digit data stream using the USPS handwritten digit data set. Left Graph: The proportion of data points with the difference from the baseline accuracy less than or equal to some percentage value. Right Graph: The mean difference between the baseline accuracy and the accuracy of the adaptive SVM.