Adaptive Support Vector Machine for Time-Varying Data Streams using Martingale (On the Detection of Concept Changes in Time-Varying Data Stream by Testing Exchangeability (UAI 2005))
Shen-Shyang Ho and Harry Wechsler Department of Computer Science, George Mason University email: {sho, wechsler}@cs.gmu.edu The Martingale Test
In this paper we propose an efficient adaptive support vector machine (SVM) for time-varying data streams based on the martingale approach [2] and using adiabatic incremental learning [1]. When a new data point is observed, hypothesis testing decides whether any change has occurred. Once a change is detected, historical information about previous data is removed from the memory. The adaptive support vector machine is a one-pass incremental algorithm that
Consider the simple null hypothesis H0 : “no concept change in the data stream” against the alternative H1 : “concept change occurs in the data stream”. The test continues to operate as long as
3. works well for high dimensional, multi-class data streams.
Background: Consider a set of labeled examples Z = {z1, · · · , zn−1} = {(x1, y1), · · · , (xn−1, yn−1)} where xi is an object and yi ∈ {−1, 1}, its corresponding label, for i = 1, 2, · · · , n−1. Assuming that a new labeled example, zn, is observed, testing exchangeability for the sequence of examples z1, z2, · · · , zn consists of two main steps [2]:
where λ is a positive number. One rejects the null hypothe(²) sis when Mn ≥ λ. Assuming that {Mk : 0 ≤ k < ∞} is a nonnegative martingale, the Doob’s Maximal Inequality states that for any λ > 0 and 0 ≤ n < ∞ , µ ¶ 1 P max Mk ≥ λ ≤ (4) k≤n λ This inequality means that it is unlikely for any Mk to have a high value.
The randomized power martingale is used to approxiQn (2) f (zi,H1) f (Z,H1) mate the likelihood ratio ln = i=1 f (zi,H0) = f (Z,H0) , also a martingale. We defined the new “likelihood ratio” for a particular point zi as ²p²i
f (zi, H1) r(zi) = ≈ f (zi, H0) pi
4 3
0
2000
4000
(1)
(2)
Test for Change Detection: Intuition Assuming that a sequence of data points with a concept change consists of concatenating two data segments, S1 and S2, such that the concepts of S1 and S2 are C1 and C2 respectively and C1 6= C2. Switching a data point zi from S2 to a position in S1 will make the data point stands out in S1. The exchangeability condition is, therefore, violated. By the Kolmogorov-Smirnov Test (KS-Test), the p-values are shown not to be distributed uniformly after the concept changes (Figure 2). The null hypothesis “the p-values output by (1) are uniformly distributed” is rejected at significance level α = 0.05, after sufficient number of data points are observed. The skewed p-value distribution plays an important role in our martingale test for change detection as small p-values inflate the martingale values.
Reference [1] G. Cauwenberghs and T. Poggio. Incremental support vector machine learning. In Advances in Neural Information Processing Systems 13, pages 409-415. MIT Press, 2000. [2] Vladimir Vovk, Illia Nouretdinov, and Alexander Gammerman. Testing Exchangeability on-line. In Tom Fawcett and Nina Mishra, editors, ICML, pages 768-775. AAAI Press, 2003.
10000
12000
14000
8 7 6 5 4 3 2
0 0
2000
4000
6000
8000
10000
12000
Data Stream
Figure 4: Categorical High Dimensional Data-set simulated data stream using a modified Nursery data set: The martingale values of the data stream with λ = 8: ∗ represents detected change point: one false alarm.
r(zi): ‘‘Likelihood Ratio’’
1.2 1 0.8
f(zi|H0)
0.6
10
f(zi|H1)
0.4 p = 0.35265
0.2
0.2
0.4
0.6
0.8
1
Figure 1: The characteristic of the new “likelihood ratio” r(zi) at a particular point zi, derived from the p-values with ² = 0.92. When ²p²i − pi = 0, pi = 0.35265. When pi < 0.35265, f (zi, H1) > f (zi, H0)
Martingale Values
5 0
0
(1 − β) log λ E(n) = E(L)
(6)
3000
4000
5000
6000
7000
1000
2000
3000
4000
5000
6000
7000
1000
2000
3000
4000
5000
6000
7000
5
Data Stream
(5)
where α is the desirable size and β is the probability of the type II error. When the alternative hypothesis H1: “concept change occurs in the data stream” is true, the mean delay time, i.e. expected value of n is:
2000
5
0
1−β λ≤ α
1000
10
The upper bound for λ,
i=1
where the pis are the p-values output by the randomized p(²) value function V , with the initial martingale M0 = 1.
8000
Figure 3: Numerical High Dimensional Data-set simulated data stream using Ringnorm and Wavenorm: The martingale values of the data stream using λ = 10: ∗ represents detected change point: four false alarm and one miss detection.
Figure 5: Multi-class High Dimensional Data-set simulated three-digit data stream using the USPS handwritten digit data set: The martingale values of the data stream using λ = 10. ∗ represents detected change point: one false alarm. Delay time are 45, 99 and 62.
Experiments: Adaptive SVM 100
where L=
log ²p²−1 i
(7)
1 0.9
P−values from KS−Test
0.8 0.7 0.6 Mean of uniformly distributed p−values 0.5 Mean of p−values computed using (2)
0.4 Change starts here 0.3
20 15
98 97 96 95 94
10 5 0 −5 −10 −15 −20
93
−25
92 −30 0 5 10 15 20 Difference from baseline accuracy (%)
0.2
Skewness
0.1 0 500
25
99
Percentage of Data Points (%)
A family of martingales, indexed by ² ∈ [0, 1], and referred to as the randomized power martingale, is defined as
6000
Data Stream
1.4
B. Construct the randomized power martingale
²pi
5
1
0 0
Mn(²) =
6
1
1.6
where αi is the strangeness measure for zi, i = 1, 2, · · · , n and θn is randomly chosen from [0, 1]. The p-values p1, p2, · · · output by the randomized p-value function V are distributed uniformly in [0, 1], provided that the input examples z1, z2, · · · are generated by an exchangeable probability distribution in the input space [2].
¢ ²−1
7
2
.
The randomized p-value of the set Z ∪ {zn} is define as
n Y ¡
8
The Martingale Test as an Approximation to Sequential Probability Ratio Test (SPRT)
A. Extract a p-value pn for the set Z ∪ {zn} from the strangeness measure deduced from a classifier
V (Z ∪ {zn}, θn) = #{i : αi > αn} + θn#{i : αi = αn} n
(3)
9
Accuracy Difference (%)
2. does not require monitoring the performance of the classifier as data points are streaming, and
0 < Mn(²) < λ
10
Martingale Values
1. does not require a sliding window on the data stream,
Experiments: Change Detection
Martingale Value
Introduction
750
1000
1250 Data Stream
1500
1750
2000
Figure 2: An Illustration: Data points are observed one by one from the 1st to the 2000th data point with concept change starting at the 1001th data point.
1831 3738 5461 Data Stream
Figure 6: Simulated three-digit data stream using the USPS handwritten digit data set. Left Graph: The proportion of data points with the difference from the baseline accuracy less than or equal to some percentage value. Right Graph: The mean difference between the baseline accuracy and the accuracy of the adaptive SVM.