BIDE: Efficient Mining of Frequent Closed ... - Semantic Scholar

Report 4 Downloads 140 Views
BIDE: Efficient Mining of Frequent Closed Sequences Jianyong Wang and Jiawei Han University of Illinois at Urbana-Champaign To appear in ICDE 2004 Presented by: Yi-Hung Wu Date: 2004/3/1

Closed Frequent Sequence Mining

Where will data mining research go? Data

Frequent Itemsets, Association Rules, Sequential Patterns, Clusters, Outliers… Knowledge

Text Classification, Web Usage Prediction, Feature Selection, Anomaly Detection, …

Constraint-based Applied DataData Mining Mining Action-oriented Text Mining

Web Mining

Multimedia Mining

Stream Mining

Action Invisible (embedded tool)

Profit

Microarray Classification, “In Vivo” Spam Filtering,…

Biomedical, Financial, Geoscience, Telecom, … P.1

Closed Frequent Sequence Mining

Where will data mining research go? Data

Frequent Itemsets, Association Rules, Sequential Patterns, Clusters, Outliers… Knowledge

Text Classification, Web Usage Prediction, Feature Selection, Anomaly Detection, …

Constraint-based Applied DataData Mining Mining Action-oriented Text Mining

Web Mining

Multimedia Mining

Stream Mining

Action Invisible (embedded tool)

Profit

Microarray Classification, “In Vivo” Spam Filtering,…

Biomedical, Financial, Geoscience, Telecom, … P.1

Closed Frequent Sequence Mining

Which patterns are interesting (applicable)? Relation

Support

M

Interestingness

Confidence All Frequent

Optimal

Itemset Sequence Tree Graph

Only Confident

P.2

Closed Frequent Sequence Mining

Which patterns are interesting (applicable)? Relation M

Support

Interestingness

Confidence All Frequent

Optimal

Itemset Sequence Tree Graph

Only Confident Maximal Frequent Closed Frequent

Top-k Frequent Closed

Non-uniform Support Threshold

Circumstance

P.2

Closed Frequent Sequence Mining

Which patterns are interesting (applicable)? Relation M

Support

Interestingness

Confidence All Frequent

Optimal

Itemset Sequence Tree Graph

Only Confident Maximal Frequent

Correlated Closed Frequent

Top-k Frequent Closed

Non-uniform Support Threshold

Approximate Constraint

Aggregate Constraint

Circumstance

P.2

Closed Frequent Sequence Mining

Which patterns are interesting (applicable)? Relation M

Support

Interestingness

Confidence All Frequent

Optimal

Itemset Sequence Tree Graph

Only Confident Maximal Frequent

Correlated Closed Frequent Frequent

Top-k Frequent Closed

Non-uniform Support Threshold

Circumstance

Approximate Constraint

Aggregate Constraint

Quantity Profitable

P.2

Closed Frequent Sequence Mining

What is “Closed Frequent”? • Take itemset as an example… le… – F ⊇ Closed F ⊇ Max F AC→ →T

P.3

Closed Frequent Sequence Mining

What is “Closed Frequent”? • Take itemset as an example… le… – F ⊇ Closed F ⊇ Max F AC→ →T

P.3

Closed Frequent Sequence Mining

How to mine closed frequent itemsets? • CHARM [Zaki: sdm02, kdd03] – IT-tree – Pruning!

1. 2. 3. 4.

t(X)=t(Y)→c(X)=c(Y)=c(X∪Y) t(X)⊂t(Y)→c(X)≠c(Y), but c(X)=c(X∪Y) t(X)⊃t(Y)→c(X)≠c(Y), but c(Y)=c(X∪Y) Otherwise→c(X)≠c(Y)≠c(X∪Y) P.4

Closed Frequent Sequence Mining

How to mine closed frequent itemsets? • CHARM [Zaki: sdm02, kdd03] – IT-tree – Pruning!

1. 2. 3. 4.

t(X)=t(Y)→c(X)=c(Y)=c(X∪Y) t(X)⊂t(Y)→c(X)≠c(Y), but c(X)=c(X∪Y) t(X)⊃t(Y)→c(X)≠c(Y), but c(Y)=c(X∪Y) Otherwise→c(X)≠c(Y)≠c(X∪Y) P.4

Closed Frequent Sequence Mining

How to mine closed frequent itemsets? • CHARM [Zaki: sdm02, kdd03] – IT-tree – Pruning!

1. 2. 3. 4.

t(X)=t(Y)→c(X)=c(Y)=c(X∪Y) t(X)⊂t(Y)→c(X)≠c(Y), but c(X)=c(X∪Y) t(X)⊃t(Y)→c(X)≠c(Y), but c(Y)=c(X∪Y) Otherwise→c(X)≠c(Y)≠c(X∪Y) P.4

Closed Frequent Sequence Mining

How to mine closed frequent itemsets? • CHARM [Zaki: sdm02, kdd03] – IT-tree – Pruning!

1. 2. 3. 4.

t(X)=t(Y)→c(X)=c(Y)=c(X∪Y) t(X)⊂t(Y)→c(X)≠c(Y), but c(X)=c(X∪Y) t(X)⊃t(Y)→c(X)≠c(Y), but c(Y)=c(X∪Y) Otherwise→c(X)≠c(Y)≠c(X∪Y) P.4

Closed Frequent Sequence Mining

How to mine closed frequent itemsets? • CHARM [Zaki: sdm02, kdd03] – IT-tree – Pruning!

1. 2. 3. 4.

t(X)=t(Y)→c(X)=c(Y)=c(X∪Y) t(X)⊂t(Y)→c(X)≠c(Y), but c(X)=c(X∪Y) t(X)⊃t(Y)→c(X)≠c(Y), but c(Y)=c(X∪Y) Otherwise→c(X)≠c(Y)≠c(X∪Y) P.4

Closed Frequent Sequence Mining

How to mine closed frequent itemsets? • CHARM [Zaki: sdm02, kdd03] – IT-tree – Pruning!

1. 2. 3. 4.

t(X)=t(Y)→c(X)=c(Y)=c(X∪Y) t(X)⊂t(Y)→c(X)≠c(Y), but c(X)=c(X∪Y) t(X)⊃t(Y)→c(X)≠c(Y), but c(Y)=c(X∪Y) Otherwise→c(X)≠c(Y)≠c(X∪Y) P.4

Closed Frequent Sequence Mining

How to mine closed frequent itemsets? • CHARM [Zaki: sdm02, kdd03] – IT-tree – Pruning!

1. 2. 3. 4.

t(X)=t(Y)→c(X)=c(Y)=c(X∪Y) t(X)⊂t(Y)→c(X)≠c(Y), but c(X)=c(X∪Y) t(X)⊃t(Y)→c(X)≠c(Y), but c(Y)=c(X∪Y) Otherwise→c(X)≠c(Y)≠c(X∪Y) P.4

Closed Frequent Sequence Mining

What is a Closed Frequent Sequence? • CloSpan [Yang&Han: sdm03] – Lexicographic sequence tree – Pruning! • Common Prefix • Partial Order • Early Termination by Equivalence

P.5

Closed Frequent Sequence Mining

What is a Closed Frequent Sequence? • CloSpan [Yang&Han: sdm03] – Lexicographic sequence tree – Pruning! • Common Prefix • Partial Order • Early Termination by Equivalence

P.5

Closed Frequent Sequence Mining

What is a Closed Frequent Sequence? • CloSpan [Yang&Han: sdm03] – Lexicographic sequence tree – Pruning! • Common Prefix • Partial Order • Early Termination by Equivalence

P.5

Closed Frequent Sequence Mining

What is a Closed Frequent Sequence? • CloSpan [Yang&Han: sdm03] – Lexicographic sequence tree – Pruning! • Common Prefix • Partial Order • Early Termination by Equivalence

P.5

Closed Frequent Sequence Mining

How to mine closed frequent sequences? • Stage 1: Generate candidate sequences – PrefixSpan + Pruning! ⇒ Prefix sequence lattice • Stage 2: Eliminate non-close sequences – Hashing: size, s-id sum • Support equality • Subsumption check

P.6

Closed Frequent Sequence Mining

How to mine closed frequent sequences? • Stage 1: Generate candidate sequences – PrefixSpan + Pruning! ⇒ Prefix sequence lattice • Stage 2: Eliminate non-close sequences – Hashing: size, s-id sum • Support equality • Subsumption check

P.6

Closed Frequent Sequence Mining

How to mine closed frequent sequences? • Stage 1: Generate candidate sequences – PrefixSpan + Pruning! ⇒ Prefix sequence lattice • Stage 2: Eliminate non-close sequences – Hashing: size, s-id sum • Support equality • Subsumption check

P.6

Closed Frequent Sequence Mining

How well does CloSpan perform? • D10C10T2.5N10S6I2.5

P.7

Closed Frequent Sequence Mining

How well does CloSpan perform? • D10C10T2.5N10S6I2.5 S6I2.5

P.7

Closed Frequent Sequence Mining

Can we mine closed frequent sequences without candidate maintenance? • BIDE – BI-Directional Extension • Forward extension events • Backward extension events

– Closure check ck • No FE • No BE

– Pruning! • BackScan P.8

Closed Frequent Sequence Mining

Can we mine closed frequent sequences without candidate maintenance? • BIDE – BI-Directional Extension • Forward extension events • Backward extension events

– Closure check ck • No FE • No BE

– Pruning! • BackScan P.8

Closed Frequent Sequence Mining

Can we mine closed frequent sequences without candidate maintenance? • BIDE – BI-Directional Extension • Forward extension events • Backward extension events

– Closure check ck • No FE • No BE

– Pruning! • BackScan P.8

Closed Frequent Sequence Mining

Where to find forward/backward extensions?

S1: MP11, MP12, MP13 S2: MP21, MP22, MP23 … Sn: MPn1, MPn2, MPn3

P.9

Closed Frequent Sequence Mining

Where to find forward/backward extensions? • FE={locally frequent items with full supports}

S1: MP11, MP12, MP13 S2: MP21, MP22, MP23 … Sn: MPn1, MPn2, MPn3

P.9

Closed Frequent Sequence Mining

Where to find forward/backward extensions? • FE={locally frequent items with full supports} • For prefix ABC, given C1A1A2BC2DA3C3E – Last instance = C1A1A2BC2DA3C3 – LLi: the i-th last-in-last appearance S1: MP11, MP12, MP13 • LL1 = A2, LL2 = B, LL3 = C3

– MPi: the i-th maximum period

S2: MP21, MP22, MP23 … Sn: MPn1, MPn2, MPn3

• MP1 = C1A1, MP2 = A2, MP3 = C2DA3

P.9

Closed Frequent Sequence Mining

Where to find forward/backward extensions? • FE={locally frequent items with full supports} • For prefix ABC, given C1A1A2BC2DA3C3E – Last instance = C1A1A2BC2DA3C3 – LLi: the i-th last-in-last appearance S1: MP11, MP12, MP13 • LL1 = A2, LL2 = B, LL3 = C3

– MPi: the i-th maximum period

S2: MP21, MP22, MP23 … Sn: MPn1, MPn2, MPn3

• MP1 = C1A1, MP2 = A2, MP3 = C2DA3

• BE={items appearing in each of MPi, ∃i)} – Scan backward each of MPi, ∀i ⇒ ScanSkip P.9

Closed Frequent Sequence Mining

How does BIDE improve the mining efficiency?

P.10

Closed Frequent Sequence Mining

How does BIDE improve the mining efficiency? • BackScan: ABC, C1A1A2BC2DA3C3E – LFi: the i-th last-in-first appearance • LF1 = A2, LF2 = B, LF3 = C2

– SMPi: the i-th semi-maximum period • SMP1 = C1A1, SMP2 = A2, SMP3 = ∅

P.10

Closed Frequent Sequence Mining

How does BIDE improve the mining efficiency? • BackScan: ABC, C1A1A2BC2DA3C3E – LFi: the i-th last-in-first appearance • LF1 = A2, LF2 = B, LF3 = C2

– SMPi: the i-th semi-maximum period • SMP1 = C1A1, SMP2 = A2, SMP3 = ∅

• ∃e ∃i, e appears in each of SMPi – Stop projection!

P.10

Closed Frequent Sequence Mining

How does BIDE improve the mining efficiency? • BackScan: ABC, C1A1A2BC2DA3C3E – LFi: the i-th last-in-first appearance • LF1 = A2, LF2 = B, LF3 = C2

– SMPi: the i-th semi-maximum period • SMP1 = C1A1, SMP2 = A2, SMP3 = ∅

• ∃e ∃i, e appears in each of SMPi – Stop projection!

P.10

Closed Frequent Sequence Mining

How does BIDE improve the mining efficiency? • BackScan: ABC, C1A1A2BC2DA3C3E – LFi: the i-th last-in-first appearance • LF1 = A2, LF2 = B, LF3 = C2

– SMPi: the i-th semi-maximum period • SMP1 = C1A1, SMP2 = A2, SMP3 = ∅

• ∃e ∃i, e appears in each of SMPi – Stop projection!

P.10

Closed Frequent Sequence Mining

Does BIDE perform much better? • BIDE/CloSpan significantly outperforms PrefixSpan/SPADE when support threshold is low • BIDE consumes much less memory and can be an order of magnitude faster than CloSpan • BIDE has linear scalability in terms of data size • BackScan and ScanSkip techniques are very effective in enhancing the performance

P.11

Closed Frequent Sequence Mining

Does BIDE perform much better? • BIDE/CloSpan significantly outperforms PrefixSpan/SPADE when support threshold is low • BIDE consumes much less memory and can be an order of magnitude faster than CloSpan • BIDE has linear scalability in terms of data size • BackScan and ScanSkip techniques are very effective in enhancing the performance

P.11

Closed Frequent Sequence Mining

Does BIDE perform much better? • BIDE/CloSpan significantly outperforms PrefixSpan/SPADE when support threshold is low • BIDE consumes much less memory and can be an order of magnitude faster than CloSpan • BIDE has linear scalability in terms of data size • BackScan and ScanSkip techniques are very effective in enhancing the performance

P.11

Closed Frequent Sequence Mining

Does BIDE perform much better? • BIDE/CloSpan significantly outperforms PrefixSpan/SPADE when support threshold is low • BIDE consumes much less memory and can be an order of magnitude faster than CloSpan • BIDE has linear scalability in terms of data size • BackScan and ScanSkip techniques are very effective in enhancing the performance

P.11

Closed Frequent Sequence Mining

Conclusion Remarks • Closed Frequent has the same expressive power as All Frequent, but provides more compact results and likely better efficiency. • Integrated optimization techniques for database projection, search space pruning, and patternclosure checking are required. • Move candidate-maintenance-and-test paradigm to a new paradigm without candidate maintenance

P.12

Closed Frequent Sequence Mining

Any Question?

3

|

P.13

Closed Frequent Sequence Mining

Any Question? My Questions…

3

|

P.13

Closed Frequent Sequence Mining

Any Question? My Questions… • CloSpan – vs. – D=D=D 3 – D=D=D

|

P.13

Closed Frequent Sequence Mining

Any Question? My Questions… • CloSpan – vs. – D=D=D 3 | – D=D=D • BIDE – How to efficiently compute or maintain MPi/SMPi? – Does it easily adapt BIDE to sequences of itemsets?

P.13

Closed Frequent Sequence Mining

Any Question? My Questions… • CloSpan – vs. – D=D=D 3 | – D=D=D • BIDE – How to efficiently compute or maintain MPi/SMPi? – Does it easily adapt BIDE to sequences of itemsets? • What is the difference between Closed Frequent Sequences and Non-trivial Repeating Patterns? P.13