BIDE: Efficient Mining of Frequent Closed Sequences Jianyong Wang and Jiawei Han University of Illinois at Urbana-Champaign To appear in ICDE 2004 Presented by: Yi-Hung Wu Date: 2004/3/1
Closed Frequent Sequence Mining
Where will data mining research go? Data
Frequent Itemsets, Association Rules, Sequential Patterns, Clusters, Outliers… Knowledge
Text Classification, Web Usage Prediction, Feature Selection, Anomaly Detection, …
Constraint-based Applied DataData Mining Mining Action-oriented Text Mining
Web Mining
Multimedia Mining
Stream Mining
Action Invisible (embedded tool)
Profit
Microarray Classification, “In Vivo” Spam Filtering,…
Biomedical, Financial, Geoscience, Telecom, … P.1
Closed Frequent Sequence Mining
Where will data mining research go? Data
Frequent Itemsets, Association Rules, Sequential Patterns, Clusters, Outliers… Knowledge
Text Classification, Web Usage Prediction, Feature Selection, Anomaly Detection, …
Constraint-based Applied DataData Mining Mining Action-oriented Text Mining
Web Mining
Multimedia Mining
Stream Mining
Action Invisible (embedded tool)
Profit
Microarray Classification, “In Vivo” Spam Filtering,…
Biomedical, Financial, Geoscience, Telecom, … P.1
Closed Frequent Sequence Mining
Which patterns are interesting (applicable)? Relation
Support
M
Interestingness
Confidence All Frequent
Optimal
Itemset Sequence Tree Graph
Only Confident
P.2
Closed Frequent Sequence Mining
Which patterns are interesting (applicable)? Relation M
Support
Interestingness
Confidence All Frequent
Optimal
Itemset Sequence Tree Graph
Only Confident Maximal Frequent Closed Frequent
Top-k Frequent Closed
Non-uniform Support Threshold
Circumstance
P.2
Closed Frequent Sequence Mining
Which patterns are interesting (applicable)? Relation M
Support
Interestingness
Confidence All Frequent
Optimal
Itemset Sequence Tree Graph
Only Confident Maximal Frequent
Correlated Closed Frequent
Top-k Frequent Closed
Non-uniform Support Threshold
Approximate Constraint
Aggregate Constraint
Circumstance
P.2
Closed Frequent Sequence Mining
Which patterns are interesting (applicable)? Relation M
Support
Interestingness
Confidence All Frequent
Optimal
Itemset Sequence Tree Graph
Only Confident Maximal Frequent
Correlated Closed Frequent Frequent
Top-k Frequent Closed
Non-uniform Support Threshold
Circumstance
Approximate Constraint
Aggregate Constraint
Quantity Profitable
P.2
Closed Frequent Sequence Mining
What is “Closed Frequent”? • Take itemset as an example… le… – F ⊇ Closed F ⊇ Max F AC→ →T
P.3
Closed Frequent Sequence Mining
What is “Closed Frequent”? • Take itemset as an example… le… – F ⊇ Closed F ⊇ Max F AC→ →T
P.3
Closed Frequent Sequence Mining
How to mine closed frequent itemsets? • CHARM [Zaki: sdm02, kdd03] – IT-tree – Pruning!
1. 2. 3. 4.
t(X)=t(Y)→c(X)=c(Y)=c(X∪Y) t(X)⊂t(Y)→c(X)≠c(Y), but c(X)=c(X∪Y) t(X)⊃t(Y)→c(X)≠c(Y), but c(Y)=c(X∪Y) Otherwise→c(X)≠c(Y)≠c(X∪Y) P.4
Closed Frequent Sequence Mining
How to mine closed frequent itemsets? • CHARM [Zaki: sdm02, kdd03] – IT-tree – Pruning!
1. 2. 3. 4.
t(X)=t(Y)→c(X)=c(Y)=c(X∪Y) t(X)⊂t(Y)→c(X)≠c(Y), but c(X)=c(X∪Y) t(X)⊃t(Y)→c(X)≠c(Y), but c(Y)=c(X∪Y) Otherwise→c(X)≠c(Y)≠c(X∪Y) P.4
Closed Frequent Sequence Mining
How to mine closed frequent itemsets? • CHARM [Zaki: sdm02, kdd03] – IT-tree – Pruning!
1. 2. 3. 4.
t(X)=t(Y)→c(X)=c(Y)=c(X∪Y) t(X)⊂t(Y)→c(X)≠c(Y), but c(X)=c(X∪Y) t(X)⊃t(Y)→c(X)≠c(Y), but c(Y)=c(X∪Y) Otherwise→c(X)≠c(Y)≠c(X∪Y) P.4
Closed Frequent Sequence Mining
How to mine closed frequent itemsets? • CHARM [Zaki: sdm02, kdd03] – IT-tree – Pruning!
1. 2. 3. 4.
t(X)=t(Y)→c(X)=c(Y)=c(X∪Y) t(X)⊂t(Y)→c(X)≠c(Y), but c(X)=c(X∪Y) t(X)⊃t(Y)→c(X)≠c(Y), but c(Y)=c(X∪Y) Otherwise→c(X)≠c(Y)≠c(X∪Y) P.4
Closed Frequent Sequence Mining
How to mine closed frequent itemsets? • CHARM [Zaki: sdm02, kdd03] – IT-tree – Pruning!
1. 2. 3. 4.
t(X)=t(Y)→c(X)=c(Y)=c(X∪Y) t(X)⊂t(Y)→c(X)≠c(Y), but c(X)=c(X∪Y) t(X)⊃t(Y)→c(X)≠c(Y), but c(Y)=c(X∪Y) Otherwise→c(X)≠c(Y)≠c(X∪Y) P.4
Closed Frequent Sequence Mining
How to mine closed frequent itemsets? • CHARM [Zaki: sdm02, kdd03] – IT-tree – Pruning!
1. 2. 3. 4.
t(X)=t(Y)→c(X)=c(Y)=c(X∪Y) t(X)⊂t(Y)→c(X)≠c(Y), but c(X)=c(X∪Y) t(X)⊃t(Y)→c(X)≠c(Y), but c(Y)=c(X∪Y) Otherwise→c(X)≠c(Y)≠c(X∪Y) P.4
Closed Frequent Sequence Mining
How to mine closed frequent itemsets? • CHARM [Zaki: sdm02, kdd03] – IT-tree – Pruning!
1. 2. 3. 4.
t(X)=t(Y)→c(X)=c(Y)=c(X∪Y) t(X)⊂t(Y)→c(X)≠c(Y), but c(X)=c(X∪Y) t(X)⊃t(Y)→c(X)≠c(Y), but c(Y)=c(X∪Y) Otherwise→c(X)≠c(Y)≠c(X∪Y) P.4
Closed Frequent Sequence Mining
What is a Closed Frequent Sequence? • CloSpan [Yang&Han: sdm03] – Lexicographic sequence tree – Pruning! • Common Prefix • Partial Order • Early Termination by Equivalence
P.5
Closed Frequent Sequence Mining
What is a Closed Frequent Sequence? • CloSpan [Yang&Han: sdm03] – Lexicographic sequence tree – Pruning! • Common Prefix • Partial Order • Early Termination by Equivalence
P.5
Closed Frequent Sequence Mining
What is a Closed Frequent Sequence? • CloSpan [Yang&Han: sdm03] – Lexicographic sequence tree – Pruning! • Common Prefix • Partial Order • Early Termination by Equivalence
P.5
Closed Frequent Sequence Mining
What is a Closed Frequent Sequence? • CloSpan [Yang&Han: sdm03] – Lexicographic sequence tree – Pruning! • Common Prefix • Partial Order • Early Termination by Equivalence
P.5
Closed Frequent Sequence Mining
How to mine closed frequent sequences? • Stage 1: Generate candidate sequences – PrefixSpan + Pruning! ⇒ Prefix sequence lattice • Stage 2: Eliminate non-close sequences – Hashing: size, s-id sum • Support equality • Subsumption check
P.6
Closed Frequent Sequence Mining
How to mine closed frequent sequences? • Stage 1: Generate candidate sequences – PrefixSpan + Pruning! ⇒ Prefix sequence lattice • Stage 2: Eliminate non-close sequences – Hashing: size, s-id sum • Support equality • Subsumption check
P.6
Closed Frequent Sequence Mining
How to mine closed frequent sequences? • Stage 1: Generate candidate sequences – PrefixSpan + Pruning! ⇒ Prefix sequence lattice • Stage 2: Eliminate non-close sequences – Hashing: size, s-id sum • Support equality • Subsumption check
P.6
Closed Frequent Sequence Mining
How well does CloSpan perform? • D10C10T2.5N10S6I2.5
P.7
Closed Frequent Sequence Mining
How well does CloSpan perform? • D10C10T2.5N10S6I2.5 S6I2.5
P.7
Closed Frequent Sequence Mining
Can we mine closed frequent sequences without candidate maintenance? • BIDE – BI-Directional Extension • Forward extension events • Backward extension events
– Closure check ck • No FE • No BE
– Pruning! • BackScan P.8
Closed Frequent Sequence Mining
Can we mine closed frequent sequences without candidate maintenance? • BIDE – BI-Directional Extension • Forward extension events • Backward extension events
– Closure check ck • No FE • No BE
– Pruning! • BackScan P.8
Closed Frequent Sequence Mining
Can we mine closed frequent sequences without candidate maintenance? • BIDE – BI-Directional Extension • Forward extension events • Backward extension events
– Closure check ck • No FE • No BE
– Pruning! • BackScan P.8
Closed Frequent Sequence Mining
Where to find forward/backward extensions?
S1: MP11, MP12, MP13 S2: MP21, MP22, MP23 … Sn: MPn1, MPn2, MPn3
P.9
Closed Frequent Sequence Mining
Where to find forward/backward extensions? • FE={locally frequent items with full supports}
S1: MP11, MP12, MP13 S2: MP21, MP22, MP23 … Sn: MPn1, MPn2, MPn3
P.9
Closed Frequent Sequence Mining
Where to find forward/backward extensions? • FE={locally frequent items with full supports} • For prefix ABC, given C1A1A2BC2DA3C3E – Last instance = C1A1A2BC2DA3C3 – LLi: the i-th last-in-last appearance S1: MP11, MP12, MP13 • LL1 = A2, LL2 = B, LL3 = C3
– MPi: the i-th maximum period
S2: MP21, MP22, MP23 … Sn: MPn1, MPn2, MPn3
• MP1 = C1A1, MP2 = A2, MP3 = C2DA3
P.9
Closed Frequent Sequence Mining
Where to find forward/backward extensions? • FE={locally frequent items with full supports} • For prefix ABC, given C1A1A2BC2DA3C3E – Last instance = C1A1A2BC2DA3C3 – LLi: the i-th last-in-last appearance S1: MP11, MP12, MP13 • LL1 = A2, LL2 = B, LL3 = C3
– MPi: the i-th maximum period
S2: MP21, MP22, MP23 … Sn: MPn1, MPn2, MPn3
• MP1 = C1A1, MP2 = A2, MP3 = C2DA3
• BE={items appearing in each of MPi, ∃i)} – Scan backward each of MPi, ∀i ⇒ ScanSkip P.9
Closed Frequent Sequence Mining
How does BIDE improve the mining efficiency?
P.10
Closed Frequent Sequence Mining
How does BIDE improve the mining efficiency? • BackScan: ABC, C1A1A2BC2DA3C3E – LFi: the i-th last-in-first appearance • LF1 = A2, LF2 = B, LF3 = C2
– SMPi: the i-th semi-maximum period • SMP1 = C1A1, SMP2 = A2, SMP3 = ∅
P.10
Closed Frequent Sequence Mining
How does BIDE improve the mining efficiency? • BackScan: ABC, C1A1A2BC2DA3C3E – LFi: the i-th last-in-first appearance • LF1 = A2, LF2 = B, LF3 = C2
– SMPi: the i-th semi-maximum period • SMP1 = C1A1, SMP2 = A2, SMP3 = ∅
• ∃e ∃i, e appears in each of SMPi – Stop projection!
P.10
Closed Frequent Sequence Mining
How does BIDE improve the mining efficiency? • BackScan: ABC, C1A1A2BC2DA3C3E – LFi: the i-th last-in-first appearance • LF1 = A2, LF2 = B, LF3 = C2
– SMPi: the i-th semi-maximum period • SMP1 = C1A1, SMP2 = A2, SMP3 = ∅
• ∃e ∃i, e appears in each of SMPi – Stop projection!
P.10
Closed Frequent Sequence Mining
How does BIDE improve the mining efficiency? • BackScan: ABC, C1A1A2BC2DA3C3E – LFi: the i-th last-in-first appearance • LF1 = A2, LF2 = B, LF3 = C2
– SMPi: the i-th semi-maximum period • SMP1 = C1A1, SMP2 = A2, SMP3 = ∅
• ∃e ∃i, e appears in each of SMPi – Stop projection!
P.10
Closed Frequent Sequence Mining
Does BIDE perform much better? • BIDE/CloSpan significantly outperforms PrefixSpan/SPADE when support threshold is low • BIDE consumes much less memory and can be an order of magnitude faster than CloSpan • BIDE has linear scalability in terms of data size • BackScan and ScanSkip techniques are very effective in enhancing the performance
P.11
Closed Frequent Sequence Mining
Does BIDE perform much better? • BIDE/CloSpan significantly outperforms PrefixSpan/SPADE when support threshold is low • BIDE consumes much less memory and can be an order of magnitude faster than CloSpan • BIDE has linear scalability in terms of data size • BackScan and ScanSkip techniques are very effective in enhancing the performance
P.11
Closed Frequent Sequence Mining
Does BIDE perform much better? • BIDE/CloSpan significantly outperforms PrefixSpan/SPADE when support threshold is low • BIDE consumes much less memory and can be an order of magnitude faster than CloSpan • BIDE has linear scalability in terms of data size • BackScan and ScanSkip techniques are very effective in enhancing the performance
P.11
Closed Frequent Sequence Mining
Does BIDE perform much better? • BIDE/CloSpan significantly outperforms PrefixSpan/SPADE when support threshold is low • BIDE consumes much less memory and can be an order of magnitude faster than CloSpan • BIDE has linear scalability in terms of data size • BackScan and ScanSkip techniques are very effective in enhancing the performance
P.11
Closed Frequent Sequence Mining
Conclusion Remarks • Closed Frequent has the same expressive power as All Frequent, but provides more compact results and likely better efficiency. • Integrated optimization techniques for database projection, search space pruning, and patternclosure checking are required. • Move candidate-maintenance-and-test paradigm to a new paradigm without candidate maintenance
P.12
Closed Frequent Sequence Mining
Any Question?
3
|
P.13
Closed Frequent Sequence Mining
Any Question? My Questions…
3
|
P.13
Closed Frequent Sequence Mining
Any Question? My Questions… • CloSpan – vs. – D=D=D 3 – D=D=D
|
P.13
Closed Frequent Sequence Mining
Any Question? My Questions… • CloSpan – vs. – D=D=D 3 | – D=D=D • BIDE – How to efficiently compute or maintain MPi/SMPi? – Does it easily adapt BIDE to sequences of itemsets?
P.13
Closed Frequent Sequence Mining
Any Question? My Questions… • CloSpan – vs. – D=D=D 3 | – D=D=D • BIDE – How to efficiently compute or maintain MPi/SMPi? – Does it easily adapt BIDE to sequences of itemsets? • What is the difference between Closed Frequent Sequences and Non-trivial Repeating Patterns? P.13