Data Cleaning Using Belief Propagation - Semantic Scholar

Report 2 Downloads 119 Views
Data Cleaning Using Belief Propagation Andrea Fang Chu, Yizhou Wang, D.Stott Parker, Carlo Zaniolo Department of Computer Science University of California, Los Angeles

1 / 30

Data Quality

Data Quality Markov Network Belief Propagation Applications

2 / 30

Data Quality Example: incomplete sensor readings

3 / 30

Data Quality Extreme solution 1: use the marginal probability

xˆ1 is the mean or mode of φ(x1 ).

4 / 30

Data Quality Extreme solution 2: use the joint probability

(xˆ1 , x2 , · · · , x8 ) is the mean or the mode of the joint posterior probability.

5 / 30

Data Quality Exploit local dependencies

6 / 30

Pairwise Markov Networks

white circles: random variables {xi }.

φ(xi , yi ): external potential.

gray circles: external evidences

ψ(xi , xj ): internal binding.

(or observations) {yi }.

Factorize joint probability:

P(~x,~y) =

1 Z

Q

(i,j) ψ(xi , xj )

Q

i

φ(xi , yi )

7 / 30

Pairwise Markov Networks

white circles: random variables {xi }.

φ(xi , yi ): external potential.

gray circles: external evidences

ψ(xi , xj ): internal binding.

(or observations) {yi }.

Factorize joint probability:

P(~x,~y) =

1 Z

Q

(i,j) ψ(xi , xj )

Q

i

φ(xi , yi )

8 / 30

Solving a Markov Network

Solving a Markov Network involves two phases: Learning phase: I I

Determine network structure (variables, neighbors); Learn potential functions: φ’s and ψ’s.

Inference phase: I

I

Infer the mean or maximum a posteriori (MAP) of unknown xi ’s, based on φ’s, ψ’s and known yj ’s. Efficient inference: Belief Propagation.

9 / 30

Belief Propagation (BP)

Basics of Belief Propagation: Iterative “message-passing” between neighbors. (J.Yedidia et al, NIPS 94)

mijt+1 (xj ) =

X

φ(xi , yi )ψ(xi , xj )

xi

bi (xi )SUM = xi φ(xi , yi )

Y

t mki (xi )

k∈N(i),k6=j

Y

mji (xi )

j∈N(i)

10 / 30

Belief Propagation (BP)

Basics of Belief Propagation: Iterative “message-passing” between neighbors. (J.Yedidia et al, NIPS 94)

mijt+1 (xj ) =

X

φ(xi , yi )ψ(xi , xj )

xi

bi (xi )SUM = xi φ(xi , yi )

Y

t mki (xi )

k∈N(i),k6=j

Y

mji (xi )

j∈N(i)

11 / 30

How BP Works

12 / 30

How BP Works

13 / 30

How BP Works

14 / 30

How BP Works

15 / 30

How BP Works

16 / 30

Application: Sensor Probing

sensor map

17 / 30

Application: Sensor Probing

define the neighbors

estimate potentials based on history: φi (xi , yi ) = P(yi |xi ), ψij (xi , xj ) = P(xj |xi ).

18 / 30

Application: Sensor Probing

define the neighbors

estimate potentials based on history: φi (xi , yi ) = P(yi |xi ), ψij (xi , xj ) = P(xj |xi ).

19 / 30

Top-K Queries for Sensor Probing

Naive probing: 1: Init:

compute expected sensor, and pick the top N; 2: Probe selected sensors; 3: Pick the top K out of the probed;

BP-based probing: 1: Init: compute expected sensor readings, and pick the top M; 2: while Beliefs not converge do 3: Probe selected sensors; 4: Propagate beliefs and update expectations; 5: Pick sensors with top expectations; 6: end while 7: Pick the top K out of the probed;

Eg. #sensors = 167, K = 10, N = 40, M = 8

20 / 30

Top-K Queries for Sensor Probing

Naive probing: 1: Init:

compute expected sensor, and pick the top N; 2: Probe selected sensors; 3: Pick the top K out of the probed;

BP-based probing: 1: Init: compute expected sensor readings, and pick the top M; 2: while Beliefs not converge do 3: Probe selected sensors; 4: Propagate beliefs and update expectations; 5: Pick sensors with top expectations; 6: end while 7: Pick the top K out of the probed;

Eg. #sensors = 167, K = 10, N = 40, M = 8

21 / 30

Top-K Queries for Sensor Probing Naive

BP-based

BP-based approach against Naive on average: 8% less probing; 13.6% higher recall for raw values; and 7.7% higher recall for discrete values.

22 / 30

Top-K Queries for Sensor Probing Naive

BP-based

BP-based approach against Naive on average: 8% less probing; 13.6% higher recall for raw values; and 7.7% higher recall for discrete values.

23 / 30

Average Queries for Sensor Probing BP-based

Naive

Average estimation error bars, on discrete values (0-11). BP-based approach against Naive on average: 8% less probing; BP has a mean error of −0.75, deviation 0.79. Naive has a mean error of −1.39, deviation 1.35.

24 / 30

Average Queries for Sensor Probing BP-based

Naive

Average estimation error bars, on discrete values (0-11). BP-based approach against Naive on average: 8% less probing; BP has a mean error of −0.75, deviation 0.79. Naive has a mean error of −1.39, deviation 1.35.

25 / 30

Application: Text Denoising

rule mutation prob. x→k 100% f→d 30% f→z 28% th → tn 52% se → ue 18% se → le 25% se → ie 21% tio → tho 20% tio → txo 20% tio → two 31% total words/errors: 3459/822

# errors % corrected 56 91% 123 92% 118 87% 220 96% 51 93% 69 94% 58 95% 35 100% 35 100% 57 98% overall accuracy: 94%

Distortion rules and error correction

26 / 30

Application: Text Denoising

rule mutation prob. x→k 100% f→d 30% f→z 28% th → tn 52% se → ue 18% se → le 25% se → ie 21% tio → tho 20% tio → txo 20% tio → two 31% total words/errors: 3459/822

# errors % corrected 56 91% 123 92% 118 87% 220 96% 51 93% 69 94% 58 95% 35 100% 35 100% 57 98% overall accuracy: 94%

Distortion rules and error correction

27 / 30

Related Work

Sensor probing I

BBQ: Global multivariate Gaussian (A.Deshpande, et al, vldb04)

Text cleaning I

NLP (G.Salton et al, McGraw Hill 83)

Data cleaning I I I

Special purpose data cleaning Classification rule-based Outlier detection

28 / 30

Conclusion

Unified Approach to Data Cleaning I I I

By exploiting data dependency Inferring missing values Correcting noisy values

Future Work I I

Extending to dynamic graph structure Application: Web usage analysis

29 / 30

Thank you !

30 / 30