Fast discovery of sequential patterns in large databases using effective time -indexing Information Sciences ( 2008 ) 4228 -4245 Ming-Yen Lin , Suh-Yin Lee and Sheng-Shun Wang National Chiao Tung University , Taiwan
Advisor : Prof. Huang, Jen-Peng Student: TU,JING-GUO
Outline
Introduction Related work Definition
An example
Performance analysis and experimental evaluation Conclusions
Introduction
Introduction
The time constraints between elements of a sequential pattern ar e not specified so that some uninteresting patterns may appear. For example, without specifying the maximum time gap, one my fin d a pattern < ( b, d, e ) ( a, f ) >, which means an item -set having a and f will occur after the occurrence of an item -set having b, d, and e. However, the pattern could be insignificant if the time interva l between the two item-set is too long such as over months.
? time
pc
printer
Ink ,paper
Introduction
Introduction
The time constraints between elements of a sequential pattern ar e not specified so that some uninteresting patterns may appear. For example, without specifying the maximum time gap, one my fin d a pattern < ( b, d, e ) ( a, f ) >, which means an item -set having a and f will occur after the occurrence of an item -set having b, d, and e. However, the pattern could be insignificant if the time interva l between the two item-set is too long such as over months.
pc
1
printer
2
3
4
5
Ink ,paper
…
100
Related work
Sequentail pattern mining GSP ( apriori ) DELISP
Definition Definition .1 (frequent item) An item x is called a frequent item in a sequence database DB if the supp ort of 1sequence is greater than or equal to minsup. Definition .2 (type-1, type-2 , prefix , stem) itemset
Type
< (a) (b) >
Type-1
< (a , b) >
Type-2
Definition Definition .1 (frequent item) An item x is called a frequent item in a sequence database DB if the supp ort of 1sequence is greater than or equal to minsup. Definition .2 (type-1, type-2 , prefix , stem) itemset
Type
< (a) (b) >
Type-1
< (a , b) >
Type-2
prefix
stem
Definition Definition .3 ( it , lst , let )
Transaction
itemset
TIdx
T1
< 1(a) 2(b) 9(d) 15(c) >
[1:1:1]
T2
< 1(a) 2(b) 9(d) 15(c) 21(a)>
[ 1:1:1 , 21:21:21 ]
[x:y:z] Last end-time initial-time Last start-time
Definition Definition .3 ( it , lst , let )
itemset
TIdx
< 1(a) 2(b) 9(d) 15(c) >
( a) (b )
[ 1:2:2 ]
< 1(a) 2(b) 9(d) 25(c) 28(a)>
( a) (c )
[ 1:25:25 ]
[x:y:z] Last end-time initial-time Last start-time
Definition
Time-constraints swin = sliding time-window mingap = minimum time gap maxgap = maximum time gap duration = constraint time window
Definition Lemma .1 ( type1 ) leti + mingap ≤ VTP ≤ lsti + maxgap
VTP = valid time periods
Definition Lemma .1 ( type1 ) leti + mingap ≤ VTP ≤ lsti + maxgap
Ex: < (b) (e) >
Transaction
itemset
TIdx
C2
< 6(a,c) 10(b) 17(e) 18(a) 24(c,d) >
[ 10:17:17 ]
a,c
1
b
2
….
10
….
e
a
17
18
c ,d
….
24
Definition Lemma .1 ( type1 ) leti + mingap ≤ VTP ≤ lsti + maxgap
Ex: < (b) (e) >
Transaction
itemset
TIdx
C2
< 6(a,c) 10(b) 17(e) 18(a) 24(c,d) >
[ 10:17:17 ]
duration = 25
a,c
1
b
2
….
10
….
e
a
17
18
c ,d
….
24
35
Definition Lemma .1 ( type1 ) leti + mingap ≤ VTP ≤ lsti + maxgap
Ex: < (b) (e) >
Transaction
itemset
TIdx
C2
< 6(a,c) 10(b) 17(e) 18(a) 24(c,d) >
[ 10:17:17 ]
maxgap = 15
a,c
1
b
2
….
10
….
e
a
17
18
c ,d
….
24
32
35
Definition Lemma .1 ( type1 ) leti + mingap ≤ VTP ≤ lsti + maxgap
Ex: < (b) (e) >
Transaction
itemset
TIdx
C2
< 6(a,c) 10(b) 17(e) 18(a) 24(c,d) >
[ 10:17:17 ]
a,c
1
b
2
….
10
e
….
17
mingap = 3
20
c ,d 24
….
32
35
Definition Lemma .1 ( type1 ) leti + mingap ≤ VTP ≤ lsti + maxgap
Ex: < (b) (e) >
Transaction
itemset
TIdx
C2
< 6(a,c) 10(b) 17(e) 18(a) 24(c,d) >
[ 10:17:17 ]
a,c
1
b
2
….
10
e
….
17
VTP 20
….
32
35
Definition Lemma .1 ( type1 ) leti + mingap ≤ VTP ≤ lsti + maxgap
a,c
1
b
2
….
10
e
….
17
VTP 20
….
32
35
Definition Lemma .2 ( type2 ) leti - swin ≤ VTP ≤ minimum of { lsti + swin , iti + duration }
Definition Lemma .2 ( type2 ) leti - swin ≤ VTP ≤ minimum of { lsti + swin , iti + duration }
Ex: < (b) (e) >
Transaction
itemset
TIdx
C2
< 6(a,c) 10(b) 17(e) 18(a) 24(c,d) >
[ 10:17:17 ]
a,c
1
b
2
….
10
e
c ,d
17
24
35
Definition Lemma .2 ( type2 ) leti - swin ≤ VTP ≤ minimum of { lsti + swin , iti + duration } Ex: < (b) (e) >
Transaction
itemset
TIdx
C2
< 6(a,c) 10(b) 17(e) 18(a) 24(c,d) >
[ 10:17:17 ]
a,c
1
b
2
….
10
e
17
An example
Item
Support
< 3(c) 5(a,f) 18(b) 31(a) 45(f) >
a
3
C2
< 6(a,c) 10(b) 17(e) 18(a) 24(c,d) >
b
3
C3
< 1(b) 20(b,g) 27(e) 34(d,g) 35(g) >
c
3
C4
< 5(a) 10(d) 21(c,d) 26(e) >
d
3
e
3
f
1
g
1
Tran, ID
sequences
C1
min_Sup=2
An example min_Sup=2 -TIdx [ 5:5:5 , 31:31:31 ] [ 6:6:6 , 18:18:18 ] [ 5:5:5 ]
Time-constraints swin = 2 mingap = 3 maxgap = 15 duration = 25
Tran, ID
sequences
C1
< 3(c) 5(a,f) 18(b) 31(a) 45(f) >
C2
< 6(a,c) 10(b) 17(e) 18(a) 24(c,d) >
C3
< 1(b) 20(b,g) 27(e) 34(d,g) 35(g) >
C4
< 5(a) 10(d) 21(c,d) 26(e) >
An example item
Tran, ID
a
C1
TIdx
sequences
< 3(c) 5(a,f) 18(b) 31(a) 45(f) > [ 5:5:5 , 31:31:31 ]
c
a ,f
b
a
3
5
18
31
Time-constraints swin = 2 mingap = 3 maxgap = 15 duration = 25
f
45
An example item
Tran, ID
a
C1
TIdx
sequences
< 3(c) 5(a,f) 18(b) 31(a) 45(f) > [ 5:5:5 , 31:31:31 ]
Time-constraints swin = 2 mingap = 3 maxgap = 15 duration = 25
duration = 25
30 c
a ,f
b
a
3
5
18
31
f
45
An example item
Tran, ID
a
C1
1.
TIdx
sequences
< 3(c) 5(a,f) 18(b) 31(a) 45(f) > [ 5:5:5 , 31:31:31 ]
Time-constraints swin = 2 mingap = 3 maxgap = 15 duration = 25
leti + mingap ≤ VTP ≤ lsti + maxgap
8 ≤ VTP ≤ 20
30 c
a ,f
b
a
3
5
18
31
f
45
An example item
Tran, ID
a
C1
1.
TIdx
sequences
< 3(c) 5(a,f) 18(b) 31(a) 45(f) > [ 5:5:5 , 31:31:31 ]
Time-constraints swin = 2 mingap = 3 maxgap = 15 duration = 25
leti + mingap ≤ VTP ≤ lsti + maxgap
8 ≤ VTP ≤ 20
1
30 c
a ,f
b
a
3
5
18
31
f
45
An example item
Tran, ID
a
C1
TIdx
sequences
< 3(c) 5(a,f) 18(b) 31(a) 45(f) > [ 5:5:5 , 31:31:31 ]
2.
Time-constraints swin = 2 mingap = 3 maxgap = 15 duration = 25
leti - swin ≤ VTP ≤ minimum of { lsti + swin , iti + duration }
3 ≤ VTP ≤ 7
c
a ,f
b
a
3
5
18
31
f
45
An example item
Tran, ID
a
C1
TIdx
sequences
< 3(c) 5(a,f) 18(b) 31(a) 45(f) > [ 5:5:5 , 31:31:31 ]
2.
Time-constraints swin = 2 mingap = 3 maxgap = 15 duration = 25
leti - swin ≤ VTP ≤ minimum of { lsti + swin , iti + duration }
3 ≤ VTP ≤ 7
1 1
c
a ,f
b
a
3
5
18
31
f
45
An example item
Tran, ID
a
C1
TIdx
sequences
< 3(c) 5(a,f) 18(b) 31(a) 45(f) > [ 5:5:5 , 31:31:31 ]
Time-constraints swin = 2 mingap = 3 maxgap = 15 duration = 25
duration = 25
c
a ,f
b
a
3
5
18
31
f
45
56
An example item
Tran, ID
a
C1
1.
TIdx
sequences
< 3(c) 5(a,f) 18(b) 31(a) 45(f) > [ 5:5:5 , 31:31:31 ]
Time-constraints swin = 2 mingap = 3 maxgap = 15 duration = 25
leti + mingap ≤ VTP ≤ lsti + maxgap
33 ≤ VTP ≤ 46
c
a ,f
b
a
3
5
18
31
f
45
56
An example item
Tran, ID
a
C1
1.
TIdx
sequences
< 3(c) 5(a,f) 18(b) 31(a) 45(f) > [ 5:5:5 , 31:31:31 ]
Time-constraints swin = 2 mingap = 3 maxgap = 15 duration = 25
leti + mingap ≤ VTP ≤ lsti + maxgap
33 ≤ VTP ≤ 46
1
c
a ,f
b
a
3
5
18
31
f
45
56
An example item
Tran, ID
a
C1
TIdx
sequences
< 3(c) 5(a,f) 18(b) 31(a) 45(f) > [ 5:5:5 , 31:31:31 ]
Time-constraints swin = 2 mingap = 3 maxgap = 15 duration = 25
1 1 1
c
a ,f
b
a
3
5
18
31
f
45
56
An example sequences
TIdx
< 6(a,c) 10(b) 17(e) 18(a) 24(c,d) >
[ 6:6:6 , 18:18:18 ]
item Tran, ID
a
C2
1.
Time-constraints swin = 2 mingap = 3 maxgap = 15 duration = 25
leti + mingap ≤ VTP ≤ lsti + maxgap
9 ≤ VTP ≤ 21
a ,c
b
e
a
c ,d
6
10
17
18
24
An example item
Tran, ID
a
C2
1.
TIdx
sequences
< 6(a,c) 10(b) 17(e) 18(a) 24(c,d) > [ 6:6:6 , 18:18:18 ]
Time-constraints swin = 2 mingap = 3 maxgap = 15 duration = 25
leti + mingap ≤ VTP ≤ lsti + maxgap
9 ≤ VTP ≤ 21
1 1 1 a ,c
b
e
a
c ,d
6
10
17
18
24
An example item
Tran, ID
a
C2
2.
TIdx
sequences
< 6(a,c) 10(b) 17(e) 18(a) 24(c,d) > [ 6:6:6 , 18:18:18 ]
Time-constraints swin = 2 mingap = 3 maxgap = 15 duration = 25
leti - swin ≤ VTP ≤ minimum of { lsti + swin , iti + duration }
4 ≤ VTP ≤ 8
1
a ,c
b
e
a
c ,d
6
10
17
18
24
An example item
Tran, ID
a
C2
1.
TIdx
sequences
< 6(a,c) 10(b) 17(e) 18(a) 24(c,d) > [ 6:6:6 , 18:18:18 ]
Time-constraints swin = 2 mingap = 3 maxgap = 15 duration = 25
leti + mingap ≤ VTP ≤ lsti + maxgap
21 ≤ VTP ≤ 33
1 1
a ,c
b
e
a
c ,d
6
10
17
18
24
An example item
Tran, ID
a
C2
TIdx
sequences
< 6(a,c) 10(b) 17(e) 18(a) 24(c,d) > [ 6:6:6 , 18:18:18 ]
Time-constraints swin = 2 mingap = 3 maxgap = 15 duration = 25
1 1 1 1 1 1
a ,c
b
e
a
c ,d
6
10
17
18
24
An example item
Tran, ID
sequences
TIdx
a
C4
< 5(a) 10(d) 21(c,d) 26(e) >
[ 5:5:5 ]
1.
leti + mingap ≤ VTP ≤ lsti + maxgap
8 ≤ VTP ≤ 20
1
a
d
c ,d
e
5
10
21
26
Time-constraints swin = 2 mingap = 3 maxgap = 15 duration = 25
An example min_Sup=2 Tran, ID
sequences
C1
< 3(c) 5(a,f) 18(b) 31(a) 45(f) >
C2
< 6(a,c) 10(b) 17(e) 18(a) 24(c,d) >
C3
< 1(b) 20(b,g) 27(e) 34(d,g) 35(g) >
C4
< 5(a) 10(d) 21(c,d) 26(e) >
-TIdx [ 5:5:5 , 31:31:31 ] [ 6:6:6 , 18:18:18 ] [ 5:5:5 ]
1 2 1 2 1 2
An example min_Sup=2 Tran, ID
sequences
[ 3:3:5 ]
C1
< 3(c) 5(a,f) 18(b) 31(a) 45(f) >
[ 6:6:6]
C2
< 6(a,c) 10(b) 17(e) 18(a) 24(c,d) >
C3
< 1(b) 20(b,g) 27(e) 34(d,g) 35(g) >
C4
< 5(a) 10(d) 21(c,d) 26(e) >
-TIdx
Time-constraints swin = 2 mingap = 3 maxgap = 15 duration = 25
2
An example min_Sup=2 Tran, ID
sequences
[ 3:3:18 ]
C1
< 3(c) 5(a,f) 18(b) 31(a) 45(f) >
[ 6:6:10]
C2
< 6(a,c) 10(b) 17(e) 18(a) 24(c,d) >
C3
< 1(b) 20(b,g) 27(e) 34(d,g) 35(g) >
C4
< 5(a) 10(d) 21(c,d) 26(e) >
-TIdx
Time-constraints swin = 2 mingap = 3 maxgap = 15 duration = 25
No more patterns can be formed
An example
Min_Sup=2 Frequent itemset
Frequent itemset
a
c (c )( b)
Tran, ID
sequences
(a ,c)
C1
< 3(c) 5(a,f) 18(b) 31(a) 45(f) >
(a )( b)
C2
< 6(a,c) 10(b) 17(e) 18(a) 24(c,d) >
(a )( d)
C3
< 1(b) 20(b,g) 27(e) 34(d,g) 35(g) >
(a ,c)( b)
C4
< 5(a) 10(d) 21(c,d) 26(e) >
Frequent itemset b (b )( a) (b )( d) (b )( e) (b )( e)( d)
(c )( e) (c )( b)( a) Frequent itemset d Frequent itemset e (e )( d)
Dealing with extra-large databases
Performance analysis and experimental evaluation
Average number of transaction per data -sequence = 10 Average number of items per transaction = 2.5 Average size of potentially sequential patterns = 4 Average size of potentially frequent itemsets =1.25 Number of data sequences in database = 100k
Performance analysis and experimental evaluation
Average number of transaction per data -sequence = 10 Average number of items per transaction = 2.5 Average size of potentially sequential patterns = 4 Average size of potentially frequent itemsets =1.25 Number of data sequences in database = 100k
Performance analysis and experimental evaluation
Average number of transaction per data -sequence = 10 Average number of items per transaction = 2.5 Average size of potentially sequential patterns = 4 Average size of potentially frequent itemsets =1.25 Number of data sequences in database = 100k
Performance analysis and experimental evaluation
Average number of transaction per data -sequence = 10 Average number of items per transaction = 2.5 Average size of potentially sequential patterns = 4 Average size of potentially frequent itemsets =1.25 Number of data sequences in database = 100k
Conclusions
This paper has presented METISP, a time -indexing algorithm for mining sequential patterns with various time constraints , inclu ding minimum-, maximum-, and exact-gaps, sliding time-windows, and durations. METISP effectively shrinks the search space of potent ial patterns.