Data Cleaning Using Belief Propagation Andrea Fang Chu, Yizhou Wang, D.Stott Parker, Carlo Zaniolo Department of Computer Science University of California, Los Angeles
1 / 30
Data Quality
Data Quality Markov Network Belief Propagation Applications
2 / 30
Data Quality Example: incomplete sensor readings
3 / 30
Data Quality Extreme solution 1: use the marginal probability
xˆ1 is the mean or mode of φ(x1 ).
4 / 30
Data Quality Extreme solution 2: use the joint probability
(xˆ1 , x2 , · · · , x8 ) is the mean or the mode of the joint posterior probability.
5 / 30
Data Quality Exploit local dependencies
6 / 30
Pairwise Markov Networks
white circles: random variables {xi }.
φ(xi , yi ): external potential.
gray circles: external evidences
ψ(xi , xj ): internal binding.
(or observations) {yi }.
Factorize joint probability:
P(~x,~y) =
1 Z
Q
(i,j) ψ(xi , xj )
Q
i
φ(xi , yi )
7 / 30
Pairwise Markov Networks
white circles: random variables {xi }.
φ(xi , yi ): external potential.
gray circles: external evidences
ψ(xi , xj ): internal binding.
(or observations) {yi }.
Factorize joint probability:
P(~x,~y) =
1 Z
Q
(i,j) ψ(xi , xj )
Q
i
φ(xi , yi )
8 / 30
Solving a Markov Network
Solving a Markov Network involves two phases: Learning phase: I I
Determine network structure (variables, neighbors); Learn potential functions: φ’s and ψ’s.
Inference phase: I
I
Infer the mean or maximum a posteriori (MAP) of unknown xi ’s, based on φ’s, ψ’s and known yj ’s. Efficient inference: Belief Propagation.
9 / 30
Belief Propagation (BP)
Basics of Belief Propagation: Iterative “message-passing” between neighbors. (J.Yedidia et al, NIPS 94)
mijt+1 (xj ) =
X
φ(xi , yi )ψ(xi , xj )
xi
bi (xi )SUM = xi φ(xi , yi )
Y
t mki (xi )
k∈N(i),k6=j
Y
mji (xi )
j∈N(i)
10 / 30
Belief Propagation (BP)
Basics of Belief Propagation: Iterative “message-passing” between neighbors. (J.Yedidia et al, NIPS 94)
mijt+1 (xj ) =
X
φ(xi , yi )ψ(xi , xj )
xi
bi (xi )SUM = xi φ(xi , yi )
Y
t mki (xi )
k∈N(i),k6=j
Y
mji (xi )
j∈N(i)
11 / 30
How BP Works
12 / 30
How BP Works
13 / 30
How BP Works
14 / 30
How BP Works
15 / 30
How BP Works
16 / 30
Application: Sensor Probing
sensor map
17 / 30
Application: Sensor Probing
define the neighbors
estimate potentials based on history: φi (xi , yi ) = P(yi |xi ), ψij (xi , xj ) = P(xj |xi ).
18 / 30
Application: Sensor Probing
define the neighbors
estimate potentials based on history: φi (xi , yi ) = P(yi |xi ), ψij (xi , xj ) = P(xj |xi ).
19 / 30
Top-K Queries for Sensor Probing
Naive probing: 1: Init:
compute expected sensor, and pick the top N; 2: Probe selected sensors; 3: Pick the top K out of the probed;
BP-based probing: 1: Init: compute expected sensor readings, and pick the top M; 2: while Beliefs not converge do 3: Probe selected sensors; 4: Propagate beliefs and update expectations; 5: Pick sensors with top expectations; 6: end while 7: Pick the top K out of the probed;
Eg. #sensors = 167, K = 10, N = 40, M = 8
20 / 30
Top-K Queries for Sensor Probing
Naive probing: 1: Init:
compute expected sensor, and pick the top N; 2: Probe selected sensors; 3: Pick the top K out of the probed;
BP-based probing: 1: Init: compute expected sensor readings, and pick the top M; 2: while Beliefs not converge do 3: Probe selected sensors; 4: Propagate beliefs and update expectations; 5: Pick sensors with top expectations; 6: end while 7: Pick the top K out of the probed;
Eg. #sensors = 167, K = 10, N = 40, M = 8
21 / 30
Top-K Queries for Sensor Probing Naive
BP-based
BP-based approach against Naive on average: 8% less probing; 13.6% higher recall for raw values; and 7.7% higher recall for discrete values.
22 / 30
Top-K Queries for Sensor Probing Naive
BP-based
BP-based approach against Naive on average: 8% less probing; 13.6% higher recall for raw values; and 7.7% higher recall for discrete values.
23 / 30
Average Queries for Sensor Probing BP-based
Naive
Average estimation error bars, on discrete values (0-11). BP-based approach against Naive on average: 8% less probing; BP has a mean error of −0.75, deviation 0.79. Naive has a mean error of −1.39, deviation 1.35.
24 / 30
Average Queries for Sensor Probing BP-based
Naive
Average estimation error bars, on discrete values (0-11). BP-based approach against Naive on average: 8% less probing; BP has a mean error of −0.75, deviation 0.79. Naive has a mean error of −1.39, deviation 1.35.
25 / 30
Application: Text Denoising
rule mutation prob. x→k 100% f→d 30% f→z 28% th → tn 52% se → ue 18% se → le 25% se → ie 21% tio → tho 20% tio → txo 20% tio → two 31% total words/errors: 3459/822
# errors % corrected 56 91% 123 92% 118 87% 220 96% 51 93% 69 94% 58 95% 35 100% 35 100% 57 98% overall accuracy: 94%
Distortion rules and error correction
26 / 30
Application: Text Denoising
rule mutation prob. x→k 100% f→d 30% f→z 28% th → tn 52% se → ue 18% se → le 25% se → ie 21% tio → tho 20% tio → txo 20% tio → two 31% total words/errors: 3459/822
# errors % corrected 56 91% 123 92% 118 87% 220 96% 51 93% 69 94% 58 95% 35 100% 35 100% 57 98% overall accuracy: 94%
Distortion rules and error correction
27 / 30
Related Work
Sensor probing I
BBQ: Global multivariate Gaussian (A.Deshpande, et al, vldb04)
Text cleaning I
NLP (G.Salton et al, McGraw Hill 83)
Data cleaning I I I
Special purpose data cleaning Classification rule-based Outlier detection
28 / 30
Conclusion
Unified Approach to Data Cleaning I I I
By exploiting data dependency Inferring missing values Correcting noisy values
Future Work I I
Extending to dynamic graph structure Application: Web usage analysis
29 / 30
Thank you !
30 / 30