Loadstar: Load Shedding in Data Stream Mining Yun Chi¹, Haixun Wang², Philip S. Yu² ¹Department of Computer Science, UCLA ²IBM Thomas J. Watson Research Center
Introduction n
Data stream systems ¤ ¤ ¤
n
Data from embedded sensors Financial and retailer data Network traffic data
Resources are limited ¤ ¤ ¤
CPU cycles Bandwidth Memory
1
Load Shedding—Which to Drop? n
Load shedding ¤
n
Dropping certain amount of loads
Which to drop? ¤ ¤
Randomly Intelligently
Load Shedding—An Example of Temperature Sensors Case 2
Case 1
Sensor A
Sensor B
80
90
100
Sensor A
80
90
100
80
90
100
Sensor B
80
90
100
2
Load Shedding in Classifying Multiple Data Streams—Introduction
Our Main Contributions n
A Novel Quality of Decision (QoD) measure ¤ ¤
n
A feature prediction model based on ¤ ¤
n
Discriminant functions Predicted feature distribution Markov-chains Real-time parameters update
Loadstar
3
Quality of Decision —Discriminant Functions Discriminant Functions 1 f (x)
f (x)
1
2
0.5
0 -0.5
0 0.5 1 1.5 2 Log Ratio of Discriminant Functions
2.5
10 0
decision boundary
-10 -20 -0.5
0
0.5 1 1.5 Feature Value
2
2.5
Quality of Decision —Based on Overall Risk n
Feature distribution in the next time unit X ~ p( x )
n
At a point x, the conditional risk for ci R (ci | x ) =
n
K j =1
σ (ci | c j ) P(c j | x )
The expected risk
E x [R (ci | x )] = R (ci | x ) p ( x )dx x
n
The decision based on expected risk δ 2 : k = arg min i E X [ R(ci | x )]
4
Quality of Decision —Based on Overall Risk n
The Bayesian risk:
[
]
E x R(c* | x ) = R(c* | x ) p( x )dx x
n
The Quality of Decision (QoD) :
(
[ ]) | x )]p ( x )dx
Q2 = 1 − E x [R(ck | x )] − E x R(c * | x )
[
= 1 − P ( c * | x ) − P ( ck x
Quality of Decision —Based on Overall Risk n
n
0 Q2 1, the higher the Q2, the more confident we are. Q2=1 if and only if ck is the minimum-risk decision at all region of the feature space.
5
Feature Prediction n n
Feature distribution: x ~ p ( x ) Take advantage of temporal locality ¤ ¤
n n
Stock price data Consecutive snapshots from satellites