A Clustering Algorithm for Logfile Data Sets
R
i s t o
V
a a r a n
d i
Department of Computer Engineering T
al l inn T R
ec h
aj a 1
5 ,
nic al
T
r i s t o . v a a r a n
Abstract. T s tored
od
ay ,
v as t amounts
in l ogfil es .
T
h erefore,
s y s tem management tas k . l ogfil e d b
1
S
y s t e m
m
a n a p
p p
O
n
a l m
l i e d
c o m
e
p
d
e n
e p
l o g
g
t h
t h e r y
e n
h ic h
el s ,
e n .
T
B
o g
l i n
r f e r
o
e
t h
. e e
patterns
entify
and
h
from
paper pres ents one to d
i m
p
o n
f
o r t a n
e x
] ,
a
t y p
c o m
a s
a
g
e n
l o g
e x
e r i e n l t
m
s e r i o u
e
c a n s i m
etec t freq
eal th
information are
l ogfil es
a nov el
anomal ous
d i g
t h
p
e s c a l a t e
t
m
u
t s ,
o n
c e o n
e r a l
r u
s y s t e m
m
e n
o p
is
an
important
c l us tering al gorith
uent patterns
m for
from l ogfil es ,
to
l ogfil e l ines .
g
l e ,
a n
t e c h
p
e x
g
f a u
c e l l e n
u
e s
t
.
s o u
h d
a n s .
e
l o g d
w
i t i o n
s
f o r
d
f
i n
o n
s i d
g .
t h
e
-
u
e s
s
a r e
o d
a y ,
s y s t e m t
i s
e v e n
d
o m
r e l e v a n
h
s
i q
T
e r
r e l e v a n c e
n
c t i o n
o t h
e r e d
g
c t i o n
t e c h
i t o r i n o r
a l l
i n
a l f u n g
a l f u n
r e l e v a n
c o n
e t e r m
m
m
e r e
o
a r e
m
e v i c e ,
h
o t i o n
d
i t o r i n
d
f i l e
o r k
f i l e
n
a n o n
l t s
e t w
e
r c e
l t s m
p r o b l e m t h
n
l o g
T
c o n
f a u
s i s
,
s i b l e
f a u
p r o p r i a t e
s y s t e m
i q
e d
l t
o s s i b l e
s e r i o u
n
e n
l o g
a l l
a p
s y s t e m
r e h
a r e
p
e n e
o r e
g
f o r h t h
m
e r a t i n
t
W
o f
t o
c o m
e n
t .
o s t
i t o r i n
a
p
c e
i c a l
l o g
E
C
e
[ 3
l o g
] ,
-
o n
o r e s
p a t t e r n o r d
e r
l o
y e d
.
d t o
F
y
l y
.
I f
t h
e
t h
t h t h a
p
c e
p
p
a t t e r n
a r e
o s e
k
f a u
r e v i o u
c o r r e s p
o n
n
n
l t s
i n
t h
s e n
d
t h
t h
t h
a t
a r e
n
k n
o w
u g
m
e n
t s
e
w
s
p
a n
b
y .
h T
h
a l r e a d n
e s s a g
f a u e
M
S
i n
l t
d , c o m k
n
e
m
e d
e a l t h
i n
b y
t s
a i n t
a n
s t a t u
n
l y
t o
t h o
c c u
s i n
g
c e
g
i n t h
s p
d
s
o
e
f
o b i l e o n
n
t h
t e x
t h
p
p
h
a l
s
m
e
l o g
f i l e
i s
n
m
e
i n
o
p
a s
f
a s t
f o r
h
f
f i l e
o n
e i r
r o a c h
o
t u
o
l o g
t h
a d
a ] ,
e v e r y
e
p a t t e r n
a p
e r e
e
, [ 1
a t a b a s e
m
o w
r s ,
d
r e l y
e
a t i o n a t c h
e c t s
e ,
s y s t e m
t h
w
l i n
s e d e
S
l e - l i n
a t
e
t h
d
u
t o
i t i o n f i l e ,
e
t h
f o r m
. ,
s i n t h
f r o m
r i t i n o n
w
a s
s
i n
e . g
i s t r a t o r s
w m
e a l t h
r a m
a t c h
e s s a g m
o d
l o g
g
p r o g
s )
- m
i s
c o n
t h
l o g
h
f i l e s ,
p a t t e r n
a d
a n
y
l o g
o r
i t h
s y s t e m
e m
t
a t t e r n
S
s y s t e m g
a r e
s c r i p
l i n
e
o f
i t o r i n
a
e
r c e
o n
s e v e r a l
,
t o
s o u m
i s
d a t a b a s e s
o w
s l y d
g
( o r
f t e n
e
e v
t o o l
a r i n
o
t h f o r
a l l y
g
( e . g . ,
o s t
a s
e d o r m
a t t e r n
M p
n
i t o r i n
a c t i o n
e
a t
i n
c o m
a
f i l e s
e v e l o p
o n
b
c e r t a i n
e s
S
m
I f
l o g
d
e t c .
f i l e
f i l e ,
c r e a t e
o f
b e e n
i s t r a t o r ) .
t y p
e t e c t e d n
a i n
d e
e
I n
e
e m
e n
o r t a n
p a v e
m
a n
f l a w
l y
t h
o n
a r e
h
t e s
a d
c e
b e p
f a g
i l l
f i l e s
i m
t h
e c u
e s s a g s
o a n
s e r v i c e ,
p r o d
S
t o
s y s t e m p
,
t o
t h
[ 2
e d
i t o r
e
w
s y s t e m t
c e m
p
e y
l i c a t i o n
t o o l s
e s , d
p
e r e f o r e ,
s e
a d
f a u
e m
e y p
of s y s tem s tatus
h is
to id
i @
.
o f
s u
c o m
a b l e
b u
b e r
e
m
i n
and
d
mining
el ps
s y s t e m
fault message patterns.
t h
h
s u r v e i l l a n o f
o s t
i s
t ,
e c a u
e s s a g
m a p
t h
h
-
b e f o r e
e
t
g a r t
s y s t e m
g
s y s t e m
u m
m
o f
i n
d
e d
e
p
t o
e v
o n
c e r n
t
e a r l y ,
o s t
c o n
i t o r i n
o r t a n
e t e c t e d
L
l ogfil e mod
o n
i m
a r e
n
w
niv ers ity Es tonia
Introduction
i s
d
uil d
ata s ets
T
U
al l inn,
a l l o n
e
i s t r a t o r m
o n
a t c h
i t o r
f o r
i t
a t a b a s e . s o l v e
i r s t ,
t h
e
i s
p r o b l e m
s y s t e m
t h
a d
m
, i n
t h
e
f o l l o
i s t r a t o r
w
i n
g
c r e a t e s
m t h
o d e
e l - b a s e d d
a t a b a s e
a p o
p f
r o a c h f a u
l t
c a n m
e s s a g
b e e
A Clustering Algorithm for Logfile Data Sets
p d
atterns as usual. o
not
rep
T
resent
hen the sy stem ad fault
c ond
itions
messages ab out suc c essful c omp id
entified
me ssag mod
( if there are any ) ,
e
p
el of the logfile.
anomalou ow
k
ev
er,
rod
F
the task
time-c onsuming,
p
that d
o p
not
and
p roac h w
p rec ise if
the
or
ork
is
T
U
c urrently
s w
F
attern d
is
hus,
amp
( or c luster)
many
and y
if the d false
c ontains
hand
I t should
greatly
b e noted
file is small,
relativ ely
for it c an b e c reated
a few
little w
I n this p from
ap
er,
logfiles,
iv id
isc usses
and
s) .
W
related
w
ap
2
ool) . ork
p
on
d ata
ap
c lustering, d
s and
of
ious,
tools for one in
ata c lustering
T
s,
w
here
as d issimilar as ed
as ob j ec ts,
atterns form natural c lusters and
to b e analy z ed
erimental
sec tion 4
ted
generally b e d
w
issimilar w
ith a
el generation.
ith suc h a tool.
into the logfile,
herefore,
d
etec ted
this p ap
I f the
the mod
el
er foc uses on the
ifferent messages.
c lustering algorithm for mining p
he rest of this p
ata sets,
b e
c lustering
tool
c alled
er is organiz ed sec tion
3
esc rib es SLCT
as follow
p resents ,
and
SLCT
a
s:
new
sec tion 5
atterns
( Simp
le
sec tion 2 c lustering
c onc lud
es the
Related work on data clustering s hav e
b
een researc hed
many
algorithms hav e b een d
follow
s:
giv en a set of p
er to d
etermine,
istanc e func tion d
norm ( p
=
1
,
2 ,
. . . )
( x ,
ev elop
oints w
oints into c lusters so that p
I n ord d
T
oses a new ex
c ould
er.
Clustering method
p
as
v ariety
has b een d
loy ment of d
I f suc h natural c lusters c ould
ith a little effort.
an
e
tremely
hen logfile lines are v iew b ec ause line p
c ontain a large numb er of d
p resents
algorithm for logfile d p
and
w
id
ing the set of ob j ec ts into group
ifferent messages are logged
the author p rop
Logfile Clustering T d
d
w
ork
allev iate the p rob lem of logfile mod
manually
logfiles that are larger,
ed
en sourc e tools are av ailab le.
that not all logfiles need
or if only
oes not
ministrator
alarms
a
c an b e ex
lines that matc h a c ertain p attern are all similar to eac h other,
ould
it d
atab ase of normal
are similar to eac h other ( and
to lines that matc h other p atterns. it w
i. e. ,
it is essential to hav e method
no suc h op
ossib le to ob j ec ts from other group
are tool,
el,
if the sy stem ad
le,
lete,
larger
nfortunately ,
c lustering algorithms are a natural c hoic e,
softw
atab ase of normal
the message c an b e regard
ell only
el for it b
Clustering algorithms aim at d
ob j ec ts in eac h group
e. g. ,
atab ases c onstitute the
oes not fit the mod
or ex
inc omp
logfile
error-p rone. el c reation.
and
ac tiv ity ,
nc e suc h lines hav e b een
ealing c hoic e for solv ing this p rob lem is the emp
algorithms.
p
ap
of c reating the mod
artic ular area, ne ap
hose tw
3
all logfile lines that
sy stem
ministrator c reates the d T
el for the logfile.
urthermore,
automating the mod
O
el-b ased
is
O
5
to further p roc essing.
mod
atterns
.
messages,
this p
the sy stem ad
entify
normal
letion of transac tions.
I f a message is logged
d irec ted
a good
p
uc ed
reflec t
n fault or normal sy stem ac tiv ity ,
the mod
has c reated
p
now
s and
message
rather
at t e rns that matc h those lines.
rep resent any
H
ministrator tries to id
b ut
1
how y )
for the d
ed
[ 4
] .
ex T
tensiv ely
ov
er
the p
n
ith n attrib utes in the d ata sp ac e ℜ ,
oints w
find
ithin eac h c luster are c lose ( similar)
c lose ( similar)
is emp
ast d
loy ed
.
M
any
istanc e func tion:
ec ad
he c lustering p rob lem is often d
tw
o p
oints x
and
y
a p
es,
and
efined
as
artition of
to eac h other.
are to eac h other,
a
algorithms use a c ertain v ariant of Lp
154 R
i s t o
V
a a r a n
d i
n
d
( x , y) = p
∑
p
x
p
− yi i
.
i =1
T
o d
o r i g u
s u
c a
a l l y
a l l y t e g
a
d
e s
') a t
d
i s t a n
l o g
f i l e
( 'C
o n
t h
e
w l i n
n
f o u
n
t h
h n
t o d
e x
i s t
2
0
i n
3
,
h
10
m
h
e
a n
d
i m
e n
b s p
g
:
g
c o
:
1,
1, b y
g
:
e
S A P
a
u
a n
]
s s w
b s p
g
d
e n
o
F m
1
9 2
e ne r a
t i o
u
t h
,
t h
s
8
d
t h
8 m
7
h
p
e
.
i s
i g h
- d
i m
l i e d
f a r
, c l u
i n
) ,
a n
s ,
h
a n
e r
e
,
g
e x
1,
18
i n
t h
e y
2
2
e
a l p
o t h
e r ,
s o m
e
( 13
,
17 49
a
i n
3 , ,
v e r y
d
1,
d
,
e n
9
a k
e s
t h
i s
s t e r i n o
s e
f
9 8
s p
e t s
m
1,
8
t h o i n
r c e s , c l u
a t a
e e n
s p
a c e s
8 0
a l
c a n b
A
t i n
t s
i c h
a l
e
o i n
a t a
o f
h
b s p 3
d
a v e
a i r w
t o t h
p
o i n h
s o u
s u
t s
p
e b e
i n
a t a
s
t h
e d
d a t a .
e r y
i n
o r i g
f o r m
a t a
o t l a r
c a n
f i l e
o d
n u
p l e ,
d
t r a d i t i o n
i s t
p o i n
d
e t h
e v
t o
e l o g
e r e
s i o n
f o r
a m
o f
m
o r e ,
d a t a
t h
h g
e a c h
a t
1,
w
i v i d
t h
i s
[ 5] ) ,
d
',
o r i c a l
o p
e s
e x
f r -
b a c k
p
o r
y
o
c a
i t
t
l i n
i t h
s e t
a t c h
b e F
b
e n
a t
i n
f i l e
e r e
w
c a t e g
s e v e r a l
t a t i o n
a l ,
e r m
c e
'h
e r e h
t e s
e r e f o r e ,
c a n
t e d
- d i m
f r o m
1,
h
t e .
s t e r i n
i g h
t h
T
w
e r e d
f o r
g h
a t
l i n
a r t
s i n d
t h
s i o n
u r t h
o r d
a t t r i b u
o n
o u
a t t r i b u
t h
c e ,
( 50
o d
h
t s
o i n
i v i c ',
m
t e s . T
p
o f u n
'C
( w
c o e f f i c i e n
c l u
h
s t e r s
d
e t h
d a ',
r e s e n
e n a l
F
s t a n
i t h
a t
l o g
( a c c o r d
8 ] .
w
s ,
J a c c a r d
c a s e a p
s e t
r e p r e s e n
t o e
d
t h
a c e s
o t e
n- t h r e p
t h
s i s t
c o m
e a c h
e
b e
t h
e
o d s p
a n
n
t s .
N
c e
s e
7
o i n t h
t h
o i t e
e t h a l
i t e
a t t r i b u
p
t r a d i t i o n
o r
o t h
o f
a s
o s s i b l e
m
e
9 3
c l u
e
2
5,
, 8
)
a c e
g
t h
a r e t h
e y
s t e r
i n
s p a c e .
t y p
. 1
u
[ 4,
e s c r i b e d
a b o
i c a l l y
l i n
l e ,
l d
a r e p
F
a l
e v e r y
. 1 6
] . 3 0
t h
c o u
a t u r a l
2
i s
p
t s
n
n
f o r
a t a
u
m
c o n
f i n
( 'H q
t a s k s i n
e
a p
15)
,
O
s e t ,
o f t e n
i m
i t i o n
o f
a m
t h
h
e ) ,
a n
l i n
e s
e
v
e
a r e
i g h
d
a l s o
- d i m
m
o s t
e n o f
r e l e v a n
s i o n t h
e
a l
l i n
t
t o
t h
( i . e . , e
e
t h
c l u
e r e
p a t t e r n
s
s t e r i n
a r e
u
g
o f
c o r r e s p
s u a l l y o n
d
t o
. 1
p l e t e
t i o e
o i n
,
3 6
c h
a r e
i s
n ≥
e r .
n c o
i n
p
t r a d
( s u
e a s y
i l l
t s
a t a
a d
i s
b e r
i s t
a r e i t
m
a n
a t e l y ,
e y
o s t
u
d
w
a y n
e s e
a t a
e nt i c a
s t e r
6 7
o n e x
f e w
d a t a ,
A
F
i t e m
a c e s ,
d
s
e
t o d
[ 7
n
e e n
s i o n
e n
s e t s
i s
o i n i t
g
i m
n f o
r
f i r s t
j o
h
d i m
n a c c e p
e n
s i o n
t e d .
o f
t h
e
d a t a
s p a c e ,
a n
d
e v
e l o p
e d
a n
d
P
d
c o r r e s p
o n
d
t o
t h
h
i g h
I Q
U
e
log: *. a l
M t
6
r o b l e m
o r d
W
p
l s o ,
v a l u
a t a t e
s t e r i n
- d
o r i c a l
a t a
t
8 . 1 . 1
e t e c t
s i o n
f i l e
6
a l m e n
y
o r
a
c l u
t h
d
a t a
a
f o r t u t h
d
d
a s
2 . 1
e n
o t h e n
p
l o g
a c e s .
r d
p a s t
s i o n
a n u
i m
a l i t y
g
a t u r a l
r i n
d
9
s p a c e
a n
o t
o r i c a l
n
t o
e x n
d
a t t r i b u
A
b e t w
a t a
g
d
c l u
l o w
y
c a t e g
i f f e r e n
d
s e t s
h
a n
c e
f t e n
U
w
2 4, m
d
. 1. 1' ) .
e r e
a l 7
o
s t e r s
e a c h
3 - 4 w
k e y
p a t t e r n
e n
t o
nne c t i o n f r o
R
h
a b l e
s t e r
a v e
c r e a s e s ,
s e v e r e
c e
s t
w
n
'r e d ') .
a t a
c l u
i r d
s i n
j u
s y
a
r ,
1
h
e r
a l
i n
a n
a n
',
t e s . w
o f
l o
8
n i n
s i o n
s i o n
a t a ,
d
e l l
e n
t h
s u
D
s u
d
d
y
i t e
a n
m
. 16
m
u
c o
d i s t a n
i t i o n
q
a i n s i d
t r a d d a t a
d
s e r v i n
2
s
u
c l o s e
i n
l o
[ 7
d
a
a n
f o r m
f r e q
- d i m
t h
l o
i m
b e
s t e r s
l o
d
t o
v e r y
o r e
l i n
a n
o t
f i l e
c l u
s i o n
( 12 ,
T
w
s i o n
e n
c l u
s e c o n
a n
o m
f o r
e r i c a l
's e d
i s
n f r o ' 19
d
c a t e g
a t t r i b u
o f
a s
t h
l o g
m
44) ,
n
e
e r .
s e e n
a r e
',
o f t e n
i g h
a
e s
i r s t l y ,
o r i c a l
o r d
g
u m
c o n
h
c t i o n
w
n
e
e
c a t e g
f u n
o r k
i m
a r e
a l
3
o t
s
t h
f r o m
e n
s t a r t s
o d
c a n
r e
a p
w
d
t s
f o r
F
a n
o i n
t
a l l e n g
p l e ,
e a s u
o f
i m
a m
p
i t e s
e t e c t i o n
e t h o r i g
t e n
o f
d
p u
t h
s ',
'f r o m
q
e r e
t y p e ,
nne c t i o
i s
o t
r o b l e m
m
n
l y ,
a v e
d
e
p
t h
d
) .
o c u
n- t h
o ',
b e r
e r e
t h
C
o f
e c o n
u m
e
e
10
'F
s
t s
t h
e c t i o n
e a s i l y
n
o i n
e x
e l ,
t
r i g h
p
w h
c h
s t e r i n
',
m
c t i o n e
i t h
r e s t S
n t h
a s
s ,
d
o r d
t o
a j o r c l u
w
a n
o
m
e l o
t e s ,
i f f e r e n
f
e d
o r d
b
s
m
( 'F
f u o
A
o w
c e
v i e w w
d
h
o i c e
] .
o
f o r
e l l
r e r ,
d
s
t w
e d
a t t r i b u 6
a n
t h
a r e n
w
l
f a c t u
b v i o u
c h
e s i g
[ 5,
nu
a t a
e r e
d
r i c a
'g r e e n
o
t h
n i s
o
v a l u m
a y ,
i n
a n
I A s e t s d
y e a r s , l i k [ 9
e
C
s e v e r a l L
I Q
U
]
a l g
o r i t h
[ 10
] :
t h
e y
t h
e y
h
a f t e r
E m
a l g ,
s
M
F
I A
c l o s e l y
s t a r t a v e
o r i t h A
i d
m
s ,
h C
a v e A
C
r e m
w
i t h
i d
e n
t i f i e d
i n e n
b T
d
e e n
U
S t h
t i f y i n
c l u
,
s t e r s
e
A
g
p
C
a l l 1,
R
r i o r i
O
f o r L
a l g
c l u . . . , C
C
m
s t e r s i n
U
c l u S
.
o r i t h i n ( k - 1
s t e r i n T
h
m
e
C
g L
f o r 1- d i m ) - d
i m
m
i n
E
i n
g
e n
s i o n
a l
e n
s i o n
a l
A Clustering Algorithm for Logfile Data Sets
sub sp and
ac es,
effec tiv d d
they
e in d
ata sp
to id
ac e.
w
een ind iv id
I nstead
entify
d
e n
s e
nfortunately ,
lik
e
c and
id
d
ata and ass
b uild
d
generate
isc ov er p
method
[ 1
n
1 ,
2
d ata U
E
,
c lusters
w
]
for d
e n
1
M
3
] .
T
s i t y
b
a
s e d ,
and
tw
inv olv es U
w
T
ex
p
d
shap he P
es,
er the d
R
O
w
hic h
CLU
c lusters in sub sp
is
relies on these p rop
3
and
d
ata. then w
t to measure
I n
high er the
ac tual
c lusters.
]
e p
red
if
one
uses the K ac e.
ic ted
H
ow
w
it tend ants -med
oid
ev er,
ac c urately ,
s to
in
and
.
largely t
and ass ov
is therefore fast,
und esirab le
b
ity
uring the sec ond
of
ist for high-d imensional d
the nex
e w
lex
es a p
ates d set
S algorithm [ 8
hat is the right v alue for K
logfile
id
the
ac es of the original sp
suitab le for c lustering logfile lines,
ata,
c omp
first mak
ata and
v ious w
of
]
etermines
asses ov
hough sev eral c lustering algorithms ex
the nature
o not attemp
onential
therefore it is not ob
erties of logfile d
hose algorithms are
here a c lustering algorithm tries
S algorithm [ 6
finally
o p
ac es from C1, . . . , Cm ,
T
forms c lusters from those regions.
ata the numb er of c lusters c an rarely
rop
d
then generates c luster c and and
stretc hed
etec ting K
ec ause they
5
I A algorithms suffer from the fac t that Ap riori-
he CACT
es only
ith
AF
b
ac e,
testing
atterns from logfiles.
[ 4
ac c ount
ata sp ac es,
b ec ause they
sec tion,
w
e
w
ill
d
on' t tak
first
they
e into
d isc uss
the
ill p resent a fast c lustering algorithm that
erties.
Clustering logfile data
3.1
The nature of logfile data
he nature of the d
ata to b e c lustered
algorithm for c lustering. generic
d
the w
ord
generic F
ata are mad
lev el,
e.
H
there are tw
p
lay s a k
ey
role w
hen c hoosing the right
ost of the c lustering algorithms hav e b een d
ow
et b ask
et d ata,
ev er,
hen w
o imp
w
ortant p rop
w
here no sp
e insp
esigned
for
assump
tions ab out the
ec t the c ontent of ty p
ic al logfiles at
erties that d
ec ific
istinguish logfile d
ata from a
d ata set.
irstly ,
most of the w
the results of ata.
M
ata sets suc h as mark
nature of d
d
and and
summary ,
S mak
ac es,
5
hic h is often meaningless in a high-d imensional
in the d ata sp
U
1
w
p roac h is d
s
s a d ata summary ,
the
are not v ery
T
i o
oints,
imensional sub sp
id ates are ac tual c lusters.
the c ase of logfile d
T
p
their ap
generation
Although CACT to
ual p
the CLI Q
ate
using
,
r e g
runtime ov erhead
p
id ates for k-d
hic h of those c and
isc ov ering c lusters in sub sp
istanc e b etw
U
form c luster c and
then c hec k
1
an ex
p
ord
s oc c ur only
a few
times in the d
ata set.
T
eriment for estimating the oc c urrenc e times of w
ab le 1 ord
p resents
s in logfile
156 R
i s t o
V
a a r a n
d i
Table 1. Occurrence times of words in logfile data
Data set
Data set
T
o
si z e
tal
d
#
o
w
o
r d
# f
i f f er en
o
o
t
f
w
c c u o
s
o
r d
n
#
s
r - r i n
o
o g
f
w
o
c c u
c e 2
r d
s
r - r i n
ti m
#
es o
o
o g
f
w
r
3
1
s e r v e r lo
g
f ile
( L
2
ac h
1
0 8
, 1
g
f ile in u
A
u
1
n
g
8
0
0
h
e
o r d
8
4
8
, 0
9
3 3
. 9
1
%
)
, 3
5
( 7
0
9
, 5
. 4
ti m
1
)
, 4
0
( 8
4
2
, 7
. 6
4 8
%
#
es o
o
o
r
f
w
1
)
, 4
4
( 8
3
4
, 1
. 8
1
0
r d
s
r - r i n
ti m
#
g
o
o
es o
r
f
w
1
)
, 4
7
( 8
2
6
, 2
. 6
2
0
r d
r - r i n
ti m
s g
es o
r
l ess
9 6
%
o
c c u
l ess
5 9
%
o
c c u
l ess
l ess
8 1
%
s g
1
, 4
)
9
( 8
3
7
, 1
. 8
6 0
%
)
1
, 8
8
7
, 7
8 0
1
, 0
2
( 5
8 0
4
3
, 0
. 2
2 9
%
1
)
, 2
5
( 6
0
6
, 6
. 3
9 7
%
1
)
, 3
5
( 7
2
9
, 5
. 0
3 5
%
1
)
, 4
5
( 7
7
6
, 4
. 2
8 9
%
1
)
, 5
6
( 8
8
3
, 1
. 1
6 5
%
1
)
, 6
9
( 8
9
5
, 3
. 8
3 6
%
)
. 9 1
MB , 8
4
, 0
1
6
, 0
0 9
3
8 3
, 9
( 9
4 8
8
, 4
. 3
1 4
%
3
)
, 9
( 9
4
9
8
, 7
. 4
7 3
%
3
)
, 9
( 9
5 8
0
, 4
. 4
3 9
%
3
)
, 9
5
( 9
8
1
, 4
. 4
9 2
%
3
)
, 9
( 9
5
3
8
, 6
. 5
9 8
%
3
)
, 9 ( 9
5 8
6
, 8
. 5
5 0
%
)
e s
l t s
o f e d
w
g
g
,
o f
s p
r i n
W
h
t o g t h
u
r
m
s e s
I n
e
t h
i d
w i s
e ,
g
" C
f
e n
e x
t
s u o f
f o r m
c o n
e
s a m
t y p
l o g
t s ,
w
f i l e
d
h
g
e
a n
f i l e
d
a t a
e n
t l y .
T
u
p
o r t
l o g
a l l w
u r i n
a c c o r d
s
a r e
e
v e r y T
f r e q i l a r
p h
e r i m
e n
e x
p
i s h
i n
i n s i m
g
t h
i s
i s
t o
a
u
a t n
e n
t h
o t
t ,
a n
e n
o m
t
n
d
s i g n o n
e a r l y
e r e
a r e g
f o r m
i f i c a n a s o
y
s i n
f
t h
e
s t r o n
c e
s t r i n
t
b e e n
%
a n
,
a t
h
50
m
s u r p r i s i n
c e r t a i n
a e n
g
b e f o r e
g ,
w
h
e r e
e . g . ,
%
e
a r e s e t .
d
f r e q
t a i n
b s e c t i o n l o g
f
f r o m
c o n
e r e ] .
a t t e d
s t a n
e
i c h
h
r
s
d a t a
[ 14 o
o c c u
o r d
e
w
l y
e r t y
a t
w
t h
a t a ,
e c t i o n
h
f
i n
o n
e r a l l y
n
w
e r t i e s
t h
a r e
t h
d
c e
p r o p s
o
c e
e b o n
t
o n
o
a j o r i t y o n
W r
g
s e t n
e
o r d
s t r i n
e s
e
m s t
o r t a n
e
a t
p r o p
t h
h
e
t h e r e b s e t
e
t o
d
e n
d
a t a
o i n
t
d
s i t y
s p
w
a c e .
n i s
t h
o f
e
a c e
m d
a t a
e
g
p
d
e d
o f
i l l
%
t h
" ,
m
i p
a n
e
y
c o n
r e s e n
t
a
a d
d r e s s ,
t i m
s t a n
c l u
p
e s , t
o r t n
t h
e r e
p a r t s
o f
s t e r i n
g
a l g
u m
w
b e r ) ;
i l l
a l s o
b e
e
f o r m
a t
t h
r e l i e s
t h
o r i t h
m
a t
m
a n
y
s t r i n
g
o n
a t a .
l d
o r i t h
m l i n
e d
u m
s p a c e ,
n w
f o r t o
t o
u h
m e r e
h
i c h
c l u f o r m
c o n a
o n b e r
d
s p
t a i n
d f i l e
o f
c e r t a i n
l o g w
t h
s t e r i n a
g
o u
o r d
l d
b e
s t e r s
o n
l o g i n
w
c l u
r e l i e s
f r o m
c o r r e s p
w
e t e c t
m
e r e d
e
m d
p r o a c h s i d
a e
o r i t h o u
a l g
a s s u t s
i m
w
a p
c o n
t h a x
a l g
i c h h
i s
f r o m e
T
a r e
r e p r e s e n s
a n
h
b a s e d
s t e r s
o r d
t h
e s i g n
a n
s p
c l u
w
d
a t a ,
d a t a
e
p
a s d
a l
e a c h
h
p
d a t a
a j u
o c c u
e e n
e s s a g e
w
t h
e t e c t e d
a r e
i m
e s s a g
t h
a i m
i n
T
d
a t
e a r W
t o
e s s a g
t f ( m
p
The clustering algorithm
v e r
o r i g
d
f o r m
e c i a l
3.2
O
m
e r .
s p
n
b e t w
e
i n
e t h e
a t h
e n
e s
s
t h
a p
o r l d
f o u
s e c o n
i n
o w s
W
e r e
e
a r t s
l i n
s h
o r d
f o r
w
h
l o g
w
4 0
( 4
MB , 7
3 9
r e s u
s
T
s u
9
4 , 8
c o r r e l a t i o n
u
, 8
r d
)
b s e r v
d
0
5 r
o
r - r i n
in 0
w
o
0
w
e s
lin
T
p
. 9
8
0
4
f r a c t i o n o
, 7
4 8
f
f ile
( W 2
8
lin
s e r v e r lo
1
MB
, 1
es o
o
c c u
x )
t h e n t ic
at io
7
# o
e s
e
( L
. 3
5
s g
x )
s e r v e r lo
5
, 6 lin
in u
C
0
7
r d
r - r i n
ti m
l ess Mail-
o
c c u
t h
e
s p
g .
P
d
p
f i l e s
o i n
a t a
p
a t t r i b u
t s
e .
e r
l i n i
T T
m
a k
e
t
i n
a t
d o
e r t i e s
o f
n
o n
o f
o t
c a t e g o r i c a l
d a t a
i n . . . , i
t h k
l o g
b e l o n
a t t r i b u
e
a
b s p
g
f e w
p
a s s e s
a c e s
f i l e
d
t o
a n
o f
t h
a t a ,
a n
y
o
f
e d
t h
e
outliers.
h
e
l y
s u
i t h
h e
1,
t h
w
d
p r e s e n p r o p
s t e r
s e t .
l i n
t e s
t s
c l u
a n
a r e
e c i a l
o i n
e c i a l
a t a
f a s t
a t
e
( 1 ≤
d
s p
t e s a c e
a t a k
o
≤
a t t r i b u f
h
t e s ,
e a c h a s
s e t .
A
n)
o f
w
d
a t a
n d i m
e n
reg
ion S
a l l
p
o i n
h p
e r e o i n
s i o n
t s
t s ,
i s
a
t h
a t
A Clustering Algorithm for Logfile Data Sets
b elong to S hav e id { ( i1, v fix
1)
ed
, . . . , ( ik , v
) } k
attrib ute) ,
at least N T
p
it first mak
another
p
ass
b efore.
to
d
ense 1
mining) .
ord
ense 1
all c luster c and tab le w
c luster ,
ord
is c onsid
p
hic h is initially
freq
sup
-regions ( freq
id ates d
ty .
T
b een d
isc ov
not p resent in the c and id ,
otherw
ise its sup
p
to the c luster c and id line b elongs to m c luster c and ex w
amp
le,
ate.
T
ith the fix
ed ,
,
ith sup p
regions that are guaranteed B p
ec ause
of
the
attern,
e. g. ,
d
' authentic ation' ) ,
( 3 ,
of
c luster
w
' for' ) ,
( 5 ,
ay
b
y
j ust p
6
T
S
then
information id ates.
the algorithm id u
entifies
e n
t
w
o
r d
s
uring the
times in the d
entified
,
ata set,
the algorithm b uild
line b
y
line,
.
t in the c and id and
1)
ed
s hav e
there ex ense 1
id ate is
ort v alue
ing w
, . . . , ( im , v
attrib utes { ( i1, v
another d
ord
the line is assigned
in the follow
and
s
ate
hen a line is
ith a sup p
I n b oth c ases,
. 1 ,
w
uent w
I f the c luster c and
attrib utes ( i1, v
. 1
ep
one or more freq
ed
8
the c and
ense)
are rep
the
' ac c ep
eac h
1)
if the
then the ) } . m
ense 1
-region w ,
ay : ) ,
m
, . . . , ( im , v
ist a d
attrib utes { ( 1
id
F
or
-region
ith the fix
ed
' Connec tion' ) ,
ate tab le is insp
set
ted
') }
of
he CLI Q
U
E
orted
fix
ed
c orresp T
b
c luster
w
ec ted
p ort threshold
y
( 2 ,
ond
s
{ ( 1
to ,
s to the line p
a
'P
and
all
c ertain line
assw
attern
the algorithm c an rep ithout rep
algorithm rep
,
v alue ( i. e. ,
the algorithm as c lusters.
c orresp
attrib utes
ond
hus,
rinting out line p atterns,
b elong to eac h c luster. [ 7
ed
and
region,
ith
U
and
ate.
e d
a
summary
into the tab le w .
ith the set of fix
authentication for * accepted. a c onc ise w
the
ual or greater than the sup
to b
efinition
the
. 1
of the algorithm,
ort v alues eq
the user.
en into ac c ount d
id ate is formed
' Connec tion' )
b ec omes the c luster c and id
regions w
roc essed
ith the set of fix
then a region w
During the final step
y
s a d ata summary ,
id ate is formed
2
e c all the set
there is j ust one
from the set of c and
hav e b een id
1 9
W
7
is a region that c ontains
he c luster c and id ates are k
m
. k
hat similar to the CACT
using
ill b e inserted
f r o
v
( i. e. ,
5
v alue.
ata set is p
it w
=
ik
uiv alent to the mining of f r e q
ill b e inc remented
n e c t i o n
attrib ute ( 1
' from' ) ,
s)
T
he c luster c and
n
i o n
ata summariz ation) ,
a c luster c and
id ate is a region w
x 1
giv en b
b uild
id ates,
-regions that hav e fix
o
e
ata and
is eq
ord
ass.
ate tab le,
ense 1
r e g
l u
ense 1 -regions ( i. e. ,
ort v alue w
if the line is C
attrib ute ( 2 ' from' ) }
d
c and
s e
v a
. . . ,
I f k =
uent if it oc c urs at least N
he d
to b elong to one or more d on the line) ,
1,
is somew
ort threshold
uent w
found
ered
p
uring one p emp
v
osition in the line is tak
ered
ec ified
o l d
c lusters are selec ted
ote that this task
=
i1
A d e n
er the d
of the algorithm ( d
N
is the user-sp
After d
b uild
ata set ( the w
A w
here N
o
es a p ass ov
As a final step
-regions.
p
n .
and
from the d
1
- r e g i o
s,
During the first step
w
1 p
x
of region S.
he algorithm c onsists of three step
es
is the s u
∈ S,
∀x t e s
r e s h
–
here N
: k
u
t h
]
w
. . . , v
t t r i b
r t
c ollec ted
all
1,
a
the region is c alled
oints,
algorithm [ 6 mak
entic al v alues v
the set of f i x e d
1
orting ind
ord
') ,
( 2 ,
Password
ort c lusters in
iv id
ual lines that
orts c lusters in a similar manner
] . T
he first step
for mining freq itemsets.
T
of the algorithm remind uent itemsets [ 1
hen,
how
ev er,
0
] ,
s v ery
sinc e freq
our algorithm tak
all c luster c and
id ates at onc e.
algorithm is ex
ensiv e in terms of runtime [ 1
and
testing inv olv es ex
logfile d mak
p
es
p
T
little
onential c omp
sense
to
test
c omb inations that are generated c omb inations are p
ord
the p
op
ular Ap riori algorithm
s c an b e v iew
es a rather d
ed
ifferent ap
b
lex
ity .
1
2 ,
1
Sec ond
3 ] ,
sinc e the c and
ly ,
Ap
p
otentially riori,
resent in the d ata set.
w
huge
hile only
een freq
numb er
of
a relativ ely
uent 1
-
generating
F
irstly ,
id
ate generation
sinc e one of the p rop
strong c orrelations b etw a
y
1 ,
as freq
p roac h,
here are sev eral reasons for that.
ata is that there are many v ery
c losely
uent w
uent w freq
Ap
riori
erties of ord
uent
s, w
it
ord
small numb er of
I t is muc h more reasonab le to id
entify
the
158
e x
R
i s t i n
w
h
m
N
o t e s
u
v e r y s t i l l
c o m
b i n
t h
a n
e l y
a n
g
d
u r i n
o n
d i d
d c e
n
o t r a t h
f i t
t o
t h
i n
o r t a n
t
i s s u
a
o
u
f
e ,
m
a i n d
y
a s s
s t r o n
e r
s u m
p
o v
t h
e
e r
t h
e
d
c o r r e l a t i o n o n
a n
t h
o r t
t h
o r y .
H
n
g
a t
t h
p
e m
i n
p
e r a t e d
l a r g
l o w
a n
a n e n
c h
g l e
a t a ,
a n
d
v e r i f y
a f t e r
t h
e
p
a s s
s t e r s .
m g
e r e
s i n
c l u
a r e
m
a
g
t o
a t e s
i v e n
t o p
s
p r e s e n
o f t e n
a s
i m
e c a n
i s
l i k
a t i o n
t h i f
d h
d i
c o r r e s p
a t
a t
s e r
a a r a n
e m
t h t h
e ,
e
V
o f
e a n l a r g
t h
d
g
i c h
i s t o
e x
t
c e ,
e
n
r e s h o
s
e i r
u m o l d
w
s u
t h
b e t w
n
u m
b e r
o f
v a l u
e ) .
e v e r ,
t h
b s e c t i o n
w
e
f r e q T
m
h
w
i s u
f r e q
u
o t
l i k
n
e n
t
w
o r y
i l l
d
t h
c o s t
i s c u
e n
o
s s
s
w
o r d t o
c a n t h
d
e
i s
s
b
m
a l s o
e
i t s e l f
e f
t h
t
e l y
o r d
e r e f o r e ,
e m
e
e e n
b e r
v e r y
( u n
i d
l e s s
a t e s
a l g
o r i t h
a t t e r
i n
a r e m
i s
m
o r e
e t a i l .
3.3
The issue of memory cost
A
l t h
t h
e r e
a l g
o u
h
e
m
m
s i z e s
T
h
e c k
d a n
r e s u l a r i e s
m
r o w
a i n
s ,
m
e m
e
h
e t h
d
a t a
e r
i n
t h
e
a t a
e t o
s h
o w
l d
o c c u
y m
o s t D b
y
a n
i t s
v o
c a b u
t
s t r u
a t
d
e r i m
u
n
s
e r
t h
s e
–
u n
g
e
w
f o r m
r r e n
d a t a
ac h A
u
o r y
v
o
c a b
u
l a r y
t h
e
a l g o r i t h
0
O
0
0
n
w
o r d
i s
a
e
o t h
a r e
I n
o r d
o c a b u
o f
c t i o n
a t e
e r
v e r y
a s t e
s t r u
e s t i m
e r e f o r e
f a s t ,
s t a n
t h
m
c e s
e
r t h
m
e r ,
w
t o
o r d
i t
p
n
l e m
g
d
o
t h
f e
v o
e
i n
t e d
g
m
i l l
w
a s o
1G
B
e m
o r y .
s d
c a b u
c a b u
b e
i n
c o n
- m
e m o v ] ) .
e n
s u
m
t h
l d
n
o t
s
t h
a n
e
d
i f
1.
I f
t e d . a o
i n
s i z e
,
t o
l o t
c a b u
e - t o - f r o n
e
c o u
o r d
e v
s t e p s e e k
s e t
s e t s ,
s
m
l a r y ) ,
t e r
o r y
m
f i r s t
w
c r e m
[ 14 a t a
l a r y
n
t o
A
e
o r i t h
e a c h
c o u
a
r d
o r
v o
c e
t h
a l g
F
( o r
e l y
t h
e n
- s i z e
w
l i k
i s
e
s .
r r e n
t e r
m
t h
o r d
t a b l e
i s
l a t i n
y t e s
w
o c c u
c o u
u
i u m
a n
i n
e a s u r i n i m
a b
e
,
t
- m
h
t h
f i t
i n
f
a s h
e m
o f
o
l a r y
o r y
e
d
a t a
t o
t h
e
i z e
T
o
tal
#
o
x )
1 x
)
0
1
2
0 1
5
8 0
8 4
3
. 3
MB
, 7
, 6
. 9
MB
, 8
, 1
. 9
MB
, 4
, 8
5
7
8
9 9
1
, 1
4 8
lines
, 7
8 0
lines
, 8
8 3
lines
f o
d
i f f er en
r d
t
T
h
s 1 1
, 7
0
, 8 4
e si z e o v o
0
, 0
8
7 1
6
c ab
u
f
th
e
l ar y
, 8
4 0
, 7
8 0
1
5
9 3
8
MB
MB
, 0
0 9
2
1
4
MB
)
t h s
w
c o n
v
logfile (Linu
e server logfile (Linu
in2
t h
c i r c u
s i z e s
S
t h ent ic at ion server logfile
(W
i s
c e r t a i n
w
C
d
e r
o r e .
Data set
Mailserver
a n
d
a r i z a t i o n
i t s
c e
a s
e g
m
s e t ,
a c c u e d
m
e
i t h
m
o f
m l i n
t h w
f o r
l a r y
f u
i n
a t a
t
a r t s u
e a c h t
d
p
a t a
l a r y
e n
o f
e t e r i o r a t e s
d
o c c u
f o r
d r e d
o v u
s i v e e
e
c t u r e
e v e n h
e n
l i t t i n
l a r g
( e a c h
y
i t s
p r e s e n
l a r y , a
e r
t h
c a b u
p
d a t a
p
g
s p
v o
f o r
a s s e s
d
e x
i s
e
p
i n
u r i n
o r d
c a b u
o h
o r y .
e x
t h p
m
t h
i l t
t s
s e t s
e
w
t w t
e m
s e t ,
i n
b u
e f f i c i e n
m
i l t .
t h
v o
i s
s t i g h
o f
t h
b u
s e r t e d
j u m
l o t
i s
p r e s e n
a n
e m
a
e
s i t u a t i o n
o r y
Table 2. In- m
t h
t
e s
i c h
c o s t ,
i n
l a r y
c o u
a k h
e
a r y
w
l t s
t h
m
m
i n
2
i s
e
g
s
b e
r e e
o c a b u
s e t
s
a b l e
i c h
m w
o r y
m
p r e s e n
T
m
s u
e m
o c a b u
t h
h
m
i l l
v
f o r w
c o n
o r d
i s e
l d
w
c h
o r i t h
p r o b l e m
s u
w
o r y .
t a b l e
e
a t a t
i t
t h
e m
v
e n
o r d
I f
a l g
o f
d
u
't ,
w
s
e
o r i t h i s n
r o n
c o u
t h
f r e q
a l g i t t h
m
t e r m
e n
f o r
o u
s t i l l
o r i t h I n
w
g h
i s
w
l a r y
s p w
e r
h
t o h
a n
i t h
d ,
f r e q
a c e . i c h
o u
n
w
t
T
t h
h
s
w
i s
f i n
t
i t
a l l y
e r t i e s
w
not o r d
i s b
t o s
o
s t o r i n
i n , b
i n
g
i m
e
p r o b l e m
need
i r r e l e v a n
p r o p
a t e l y ,
i l l t h
e
e r e f o r e ,
f o r t u n
i t h s
o f
t .
o r d
o r d
e
e n
U
e
w
o n u
w
c o p
i c h w
h i n
p
w
i t .
e
l o g
f i l e
o s e
u
u
e n s e
t o
a t a
p
i n
i s
t h
f r e q
r e d
u
i c t
a t
d
e n
a
t
m
w
u r i n
a j o r i t y
o r d g
t h
s e
o f
t h
t o
m
e m
o r y
e
v o
c a b u
l a r y
t . t h
e
s t o r e d
i n
B
t h
e f o r e
d
v e r y
o s s i b l e
f r e q
e
f t h
f o l l o
e
m
e m d a t a
w
i n
g
o r y , p
a s s
t e c h a n i s
d m
n
i q u t h
e n
a d
e
e
-
w
e
f i r s t
c r e a t e f o r
b u
i l d
t h i n
e g
A Clustering Algorithm for Logfile Data Sets
the v oc ab ulary , summary from 0
the algorithm mak
v ec tor. T
to m-1
)
w
he w
ord
es an ex
summary
ith eac h c ounter initializ ed
fast string hashing func tion is ap v alues from 0
to m-1
,
and
c ounter in the v ec tor w uniform [ 1 1
/ m,
5
] ,
i.e.,
p
lied
the p
is the numb er of d
ifferent w
ord
the w
oc ab ulary ,
s w
ill c orresp
1,
b ut only
those w
ord
s that d
q w
to b e b elow
..., w
ow
ery
infreq
c ounter v alues w v alue
has
effec tiv
uent w
b een
sp
ec ified
eness of the w
d
he
ex
p
eriments
ramatic ally
( d
uring the ex
the ex
memory
ord
amp
le,
) .
T
p
,
uc es v
p
eriment,
ab le
4
b
p
3
er the d
for a w
ord
,
ata,
a
the i -th
s w
p
..., w
...+
m w
ord
uent,
is
here W
ord
ec tiv ely ,
s that
then the
tk .
for w
giv en b
ing the
hic h their
y
the user.
b ec ause their oc c urrenc e
uent,
an
this simp
le tec hniq ue is
of the c ounters in the v ec tor
ith them,
v ec tor tec hniq
w
.
infreq
resents
s,
are all w k
resp
ort threshold
ort threshold
p
1,
/
times,
a maj ority w
to W
the algorithm starts b uild
ort threshold
s assoc iated
summary
and
therefore most of the
( unless a v ery
ex
p
eriment
ue for three d
low
for
threshold
measuring
the
ata sets ( eac h c ounter
y tes of memory ) .
suggest
red
req
ord
ass ov
into the v oc ab ulary
s are v ery
ill nev er c ross the sup
in the v ec tor c onsumed T
ord
erful. I f the v ec tor is large enough,
ill hav e v
ord
uals to the sum t1+
the sup
ord
he func tion returns integer
roughly
oc c ur t1, ..., tk k
ill b e inserted
of the w
s a w
string hashing to a giv en v alue i
ond
ual or greater than the sup
iv en that a maj ority
uite p
s w
. T
o not fulfill this c riterion c an' t b e freq
times are guaranteed G
ord
b uild
9
of m c ounters ( numb ered
is c alc ulated
s in the d ata set. I f w
ord
ata and
5
. Sinc e effic ient string hashing func tions are
v ec tor has b een c onstruc ted
c ounter v alues are eq W
ord
of an arb itrary
v alue of the i -th c ounter in the v ec tor eq After the summary
e up
to z ero. During the p
eac h time the v alue i
rob ab ility
and
ass ov er the d
to eac h w
ill b e inc remented
then eac h c ounter in the v ec tor w
hash to the v alue i ,
v
tra p
v ec tor is mad
1
that
oc ab ulary
the
siz es,
v oc ab ulary
uirements
for
the largest v ec tor w
emp
loy ment
and
large amounts of memory
siz es d
storing
e used
d
ec reased
the
2
of
5
v ec tor
uring the ex
p
the
-1
0
0
w
ord
summary
times) . O
itself
are
w
ec tor
n the other hand ,
relativ ely
eriments oc c up
v
ill b e sav ed
ied
small.
less than 4
0
F 0
or K
B
of memory .
Table 3. The effectiveness of the summary vector technique
Data set
S
u
th Mailserver
logfile (Linux)
C
ac h A
ut h ent ic at ion server logfile (W
in2
0
0
0
ac h A
ut h ent ic at ion server logfile 0
0
0
ac h
e server logfile (Linux)
A
ut h ent ic at ion server logfile 0
0
(4
. 1
0
%
. 1 0
%
. 1
logfile (Linux)
C
in2
(8 %
o
r t
6
o
, 5
1
7
, 8 8
1
9
7
, 9
V
ec to
l d
1
8
r
T
si z e
d
o
tal
i f f er en
#
o
t w
f o
# r d
s
o
th
f
e v o
w
o
r d
c ab
s i n u
R
l ar y
ed
u
c ti o
f ac to
n
r
)
5
, 0
0
0
1
, 7
0
0
, 8
4 0
4
0
, 9
3
5
4
1
. 5
4
)
5
, 0
0
0
1
, 8
8
7
, 7
8 0
1
8
, 9
9
8
9
9
. 3
6
)
5
, 0
0
0
4
, 0
1
6
, 0
0 9
3
3
. 9
7
1
1
8
, 2
0 8
%
(7
, 6
(8
, 1
(4
, 8
5
7
8
9 9
1
)
2
0
, 0
0
0
1
, 7
0
0
, 8
4 0
6
1
, 2
4
4
2
7
. 7
7
)
2
0
, 0
0
0
1
, 8
8
7
, 7
8 0
5
0
, 2
4
6
3
7
. 5
7
)
2
0
, 0
0
0
4
, 0
1
6
, 0
0 9
7
3
, 3
7
6
5
4
. 7
3
)
Mailserver
(W
1
e server logfile (Linux)
0
(7
%
logfile (Linux)
C
in2
%
1
p
)
Mailserver
(W
1
e server logfile (Linux)
p
r esh
0
0 0
. 0
0
. 0 . 0
1 1
%
(7
1
%
(8 %
(4
6
5
1 8
8 9
)
1
0
0
, 0
0 0
1
, 7
0
0
, 8
4 0
6
6
, 8
4
9
2
5
. 4
4
)
1
0
0
, 0
0 0
1
, 8
8
7
, 7
8 0
6
9
, 2
1
0
2
7
. 2
7
)
1
0
0
, 0
0 0
4
, 0
1
6
, 0
0 9
3
1
. 1
5
1
2
8
, 9
2 2
)
I f the user has sp
ec ified
numb er of c luster c and id
a rather low ates w
ith v ery
sup low
p
ort threshold sup
p
v alue,
ort v alues,
there c ould
and
the c and
b e a large id
ate tab le
160
c o u
R
l d
c o n
v e c t o r i s
i s t o
b u
s u
t e c h
i l t ,
m
a a r a n
e
n
t h
V
a
i q
e
u
s i g n
e
a l g
d i
i f i c a n
c a n
a l s o
o r i t h
o u
m
m
a k
i c h
i s
l a t e r
e s
p
a n
t
e x
o f
m
t o
c l u
t r a
p a s s
d i d
a t e
t a b l e .
4
Simple Logfile Clustering Tool
I n
o r d
d
e v
L
e l o p
i n
u S
L
c a n
e d
x ,
h
T h
s p
a s h
h L
C
m
e
h
t h
x
a
d
a n
o p
a s h
i n
i f
a
f u
n
t a b l e
s ,
o n
i n
c o r r e s p
p
o n
h
t
h
a s h
a s
a
.
u t
d a t a ,
i t
c l u
p
w
c o n
a r e
n
d
e r
i l d
i n
a
s u p
t o
e . g
a v e
m
o r t
c l u
t h d
e
s e d
,
S
R
C
a s
e d
h
a n
u
s e s
m
. 0
n
t h
c y
e v
o
t i m e
o t
e n
[ 14
] .
f
e
t h
e s
c a n
f a s t o n
e
o v e -
s l o t
t h
i s
s
8
d m
e s ,
t a b l e
a c c e s s
T
e
b e e n a t
a t
e f f i c i e n
o r i t h
t h
r e v i o u
h
l a r y
t i m
a t a
t o
s .
a c c e s s
e
v e c t o r
p
t h
a s h
t h
L
a l g
e
o n
a r y t a b l e
i n
o o l )
s t r a t e d
h
d
T
m
a r y
t h
g
o c a b u
a t a
e
i s , i s
v
d
t h
t h h
a n
l y
u
d
s e d
e c t o r s .
r e s h i n
w
m
m
a t e
s e r t e d
s e d
o n
f o r
s t e r s
i n
s u i d
l a t f o r m
v
e m
c e
o r t a n
a r y t h
g
e d
s u m
i n
p
e a c h
T
a
u
I X
t o
a v o i d
s u m p
t i n
h
N
l o
] .
s
t h
c a n
a t e s
a r i l y U
e n
u
i s , e
s t e r i n
v e r y
p
[ 15
l u
e c t e d
i s
m
g
o r t s
s t e r s ,
i m
c t i o n
o r i t h
r e p
n
i l d
d i d
C
e r n
l e m
t h t h
d e s c r i b e d
p r i m
o d
i t h
b u
f i l e
l a r i e s
s
o r d
b u
a n
i m
m
d c a n
m
o g
b e e n
o s t
c t u r e
f u
a l g
f o r
a s
o c a b u
g
g
h m
c r i t i c a l
I n
i n
f i l e s
t o
o r d
i n
l o g
d
v
w
a s h
a l s o
d
L
a v o i d b e f o r e
o f
o r i t h
l e
t o –
a n
b e r
a l g p
e r
d a t a
u m
g
f o r
e
i d a t e s
e
n
i m
o n
s t r u
y
a r g i n
g
b u o f
h
m
s t r i n
l i s t
a n
e r
a n
o r d
d
t h
e
( S
C
l a r g
c t i o n
w
T
o r k
d a t a
m
e r
t h
t a b l e s
i t h t
C
w
a s h
d
s l o
e r a t i o n
a t
h
d
c e
s t e r i n
L
i n
a n
w
a n
g
o r
g
t
u
c l u S
r i t t e n i l e
t s
l l
a c c e p
s t e r i n t h
w
e f f i c i e n f u
a
f i l e
c a l l e d
p
e n
i s
d - X
i v e n
s
l o g
c o m
e r i m
s e
u n
c l u
p a t t e r n
p
h
a n
g
l d
i s
e
i f t - A
i s
e
t o o l
o v e - t o - f r o n
t a b l e
t a b l e
t h
b e e n
o u
E
b e c a u
y
t t a l
a s
s h
t a b l e
f
h
T
e t e c t e d
l i n
s e s
o
S
e n
T
a s h
b t
a s h
S
C i t
t a b l e ,
c r e a s e
f o r
d
h
e e d
e f f i c i e n
L
g h
a s h
e
e n
e r i m
t a b l e .
h
t h
e
i n
u
t
e n
h
T
p l e m p
S
o u
a t e
t o - f r o n w
.
t h
C
d i d
i m e x
r e d
I n
c a n
o v
c a n
t o
t o
o r y .
s t e r
a t e s ,
a n
s e d
e m
d i d
,
u
n
l i e d
c a n
e r
h
a m a p
f o r
s e c t i o n
w
t
b e
o l d a
a s
c o n
i n
p
c i s e
u w
t ,
a n
a y
b
d
a f t e r
y
p
r i n
i t
t i n
h
g
a s o u
t
. ,
Dec 18 * myhost.mydomain * connect from Support: 570 Dec 18 * myhost.mydomain * log: Connection from * port Support: 570 Dec 18 * myhost.mydomain * log: Support: 679 T c a n F
h
o r
t h
t h
e
t h
s p
o r e
c l o s e l y ,
d
e
u
o
f
e o u
a l
o r i t h
a l g
o r i t h
s u m
s m
a n
t h
p
s
l i k
a n
o r t
r e q
d
u i r e e
l i n
d
m
C
L
t h I Q
T
1,
c h
t h
U
g w
E
w
h
f l a g
s
t h
w
h
p
a t t e r n
c e
a l l
o f
t h
e n
e r a l b
a t
o i n
i c h
t d
e t h
o
m n
u
s t o t
m d
t h b e f o l l o
i n t h
e r
a m
p
t h
o r e
s p
e c i f i c
C
c a n
1,
. . . , C
C
1,
a l w
o l d .
a r t
o f t h
i s
e
o n
p
t o
l t h e
r e q
c l u
r e p o u
h
e n
e a c h d
p
a r e d
t o
a l s o o r e
t r a d o n
l y ,
t h
m
n
t ,
i n
o r d
d
s u
s i d
w
h
e r
a r e t o
t h
p
p
t o
c l u e n
e
o r t
e r e d
e
c l u
i s
a t c h
f o r
e
a l
e r e
t a b l e
a l s o
o n
e v e n i t i o n
e
a t t e r n
t h
a n
s t e r
t a b l e .
p
c o n
t h ,
t h
d
f o u
e d
c l u
a t e
i n
a t t e r n
s
a d
i d
s e c o n
o r t e d
g
s t e r
u i r e m
d
m
e c t c a n
i d a t e s e
a r e k
g
d
a t t e r n a r e
k
s p e
t h
s e c o n
a y s A
c a n l e ,
. . . , C
b e l o n
a r e
C
o t h
a t c h
s
w
t o i n
m
r e s h p
L
e x
e
a t t e r n
T
s t e r s
e
d i d a t e s
l i n
S
c l u
a r e
a b o v
i d a t e s
c a n
e
f o r
e r e
e a t
g
a
p
w
t h
t i n
t o
f o r c e s
t h
t h
e s
a y ,
l i n
e l o
e r
I n
c a n
g
a t
s e a r c h
e t h
e
w
t h
e
s . l i n
t h
b e l o n
I n
p
e
r e p r e s e n k
e r e
e v e r y
l i n
e c k
e s
a t
C.
d
s t a r t s
e
s i n
. . . , C
o r e
a t
,
v a l u
e s
e s
a n i t
l i n
i r d
o r t
i d a t e
v a l u
C
C
p
a l l
c a n
s l y , p
p
L
t h
a t e s
s u
d
e
e
m
e f o r e
e c i f i c
t h
d i d e
c o m b
S
s p
a n
t h
a
C,
o r e
c a n
C,
t o
a l g
m t h
C,
g l t a n
i n
I f
a t e
e c i f y
i d a t e
t
e c i f i c
i r d .
d i d
v a l u
o r i g
c a n
m c a n
s p
b e l o n s i m
s e r
r e p r e s e n
o r e
c a n
u a t e
e a c h
a t
m
e
d i d
s t e r t h
e i r
s t e r i n
g
s e v e r a l a c h
i e v e
A Clustering Algorithm for Logfile Data Sets
c lustering results that are more c omp B d
y
d
efault,
etec ted
SLCT
c lusters.
is no c onc ise d p
attern.
b ut this c ould
hic h tend
d
mak
high sup
y
d
too ex
N
p
oes not c orresp outlier p
v alue has b een sp many
outliers.
user has sp
T
ec ified
ond
oint c ould
ec ified
b
ossib ly
rep
y
outlier p
line
b e stored
c ost - esp
herefore,
there
to any
etec ted
d
oints,
,
to
ec ially
the end
SLCT
a c ertain c ommand
hen there are many
1
of the
ec ial c luster,
ata after c lusters hav e b een d
ote that w
to the file of outliers ( and
to form a sp
etec ted
6
] .
o not b elong to any
ensiv e in terms of memory
ort threshold
I f the end
ered
sinc e it d
eac h d
c lusters and
ass ov er the d
oints to a file.
p
user [ 7
oints that d
oints are c onsid
ata set,
ay
p
efault.
es another p
all outlier p SLCT
b e w
s to c reate few
isc ov er outliers b
SLCT
ort the p
tion for this c luster,
p roc esses the d
hen a relativ ely
w
rehensib le to the end
oes not rep
hough outlier p
esc rip
As SLCT
memory , w
T
d
1
user,
oes not
line flag, and
w
rites
one c an ap
eat this p roc ess iterativ ely
for ev
ery
p ly new
outlier file) . W
e hav e mad
e many
ex
for b uild ing logfile mod p
p
eriments w
els and
d
resents the results of some our ex c onsump
w
ork all d w
tion
station w
of
SLCT
ith 2
.
M
B
ata c lustering task
s,
as also instruc ted
5 6
T
he
ith SLCT
ex
p
to id
ord
p
eriments and
summary
entify
and
it has p rov ed
to b
e a useful tool
atterns from logfiles.
eriments for measuring the runtime and
of memory a w
,
etec ting interesting p
outlier p
R
w
ere
ed
c ond
hat 8 . 0
uc ted
v ec tor of siz e 5 oints,
on
Linux
1
as op
0
0
0
w
, 5
G
H
T
ab le 4
memory
z
P
entium4
erating sy stem.
as used
four p asses ov er the d
.
F
or
Sinc e SLCT
ata w
ere mad
e
altogether.
Table 4. Runtime and memory consumption of SLCT
Data set
S
u
th
p
p
o
r esh
r t o
l d
# d
c l u Mailserver
logfile (Linux)
1
0
Mailserver
logfile (Linux)
5
%
Mailserver
logfile (Linux)
1
%
Mailserver
logfile (Linux)
0
ac h
e server logfile (Linux)
1
0
C
ac h
e server logfile (Linux)
5
%
C
ac h
e server logfile (Linux)
1
%
C
ac h
e server logfile (Linux)
0
A
ut h ent ic at ion server logfile
. 5
C
(W
in2
A
0
0
0
A
in2
0
0
A
0
in2
0
T
0
0
0
1
4
(3
8
2
, 8
5
7
)
(7
6
, 5
7
1
)
(3
0
5
%
1
%
0
. 5
0
0
-2
8
5
8
, 9
7
8
(4
0
9
, 4
8
9
)
(8
1
, 8
9
7
)
%
(4
%
0
, 9
4
1
)
1
8
9
, 1
8
8
(2
4
4
, 5
9
4
)
(4
8
, 9
1
8
)
1
)
0
%
8
9
tl i er
M
ts
1 3
8 1
8
4
8
c o
1
2
, 1
6 6
, 1
6 6
4
, 3 1
em n
2 9
0
2
su
o m
R
r y p
ti o
u
n
ti m
e
n
1
3
0
1
1
1
3 1
, 2
5
5 5
3
B
2
9
B
K
in 5
0
1
m
1
sec 5
sec
in 3
m
1
B
m
0
sec 4
in 3
0
1
K
6
m
1
sec 8
in 5
0
B
K
1
5
K
0
m
1
sec 7
in 3
7
B
8
8
3 6
K
6
m
B
7
in 1
7
K
2
6
2 8
2
5
m
B
0
in 1
7
K
6
m
B
0
2
7
K
8
4
B
2
7
5
K
7
2
6
2
3
6
, 3
5
1
0
3
3
1
4
6
2
%
(2
4
, 4
5
9
)
5
sec
in 5
m
in 1
sec
6 6
sec
1
sec
3
4
1
, 2
5
6
5
1
1
2
K
B
1
1
m
in 3
8
sec
4
6
1
, 2
5
6
7
3
4
8
K
B
1
1
m
in 5
8
sec
3
, 3
8
9
3
2
K
B
1
1
m
in 5
4
sec
6
that our algorithm has mod
c lusters from large logfiles in a relativ ely
ered
3 0
1
)
8
7 2
1
)
(4
u i n
5
8
1
)
some tests w
1
, 2
o o
ster s
)
(8
ith CLI Q
U
E
algorithm,
algorithms in terms of runtime.
low
8
f p
)
he results show
many
, 7
o
)
ut h ent ic at ion server logfile (W
5
# f
)
0
in2
6
%
. 5
ut h ent ic at ion server logfile (W
(7
%
1
ut h ent ic at ion server logfile (W
%
o
etec ted
) , ,
our algorithm w the d
as 5
-1
ifferenc e inc reased
E
in ord
req
er to measure the d
v en for med 0
est memory
ium sup
times faster. ev en further.
uirements,
short amount of time.
p
find
s
e also mad
and
e
ifferenc e of the tw
ort threshold
As the sup p
W
o
v alues ( suc h as
ort threshold
v alue w
as
162
5 F
R
i s t o
V
a a r a n
Future work and availability information
o r
a
f u
t u r e
w
o r k
t o
c r e a t e
a n
f i t
i n
c e r t a i n
i s
t o
a
a l g
a v a i l a b l e
R
d i
a t
,
w
e
o r i t h t i m
h
p
t o
f o r
e
t t p
l a n
m w
: / / k
d
i n
d
u
. n
o d
i n
v e s t i g
e t e c t i n
o w
.
S
g
L
a t e
p
C
T
e t i . e e / ~
v a r i o u
a t t e r n i s
d
s
s
t h
a s s o c i a t i o n
a t
i s t r i b u
s p
a n
t e d
u
r u l e
o v e r
n
d
e r
m
t h
u
e
a l g
o r i t h
l t i p l e
t e r m
s
m
l o g
o f
s ,
f i l e
G
N
i n
o r d
l i n
U
G
e s
P
e r
a n
L
,
d
a n
d
r i s t o / s l c t / .
eferenc es
1.
S
t e p h W 15
2
.
.
.
6
S
.
u
d
V
P
k
C
f e r e n e s h
C .
10
w
D . R
i n g
a t a
E
e
e n
o f
D
. J u
.
A
F
un d
i n s . Aut o
E
c e
R
N
I X
A
a n .
d
e p
r f / , e n
m
th
7
d
l o g
19 e n
9
t
o
e r k
h i n
a s t o g i ,
a n
a t e d
S
S
y s t e m
y s t e m
A
d
m
M
i n
o
n i t o
r i n
g
C
o n
i s t r a t i o n
a n
d
N
f e r e n
o
t i f i c a
c e ,
p
p
t i o
. 14
n 5
-
s ur f e r ( 1
)
a
n
d
l o
g s ur f e r . c o
n
f ( 4
)
m
a
n
ua
l
p a
g
e s .
5 . T
o
o l
S g e
f o r
K
y u
L
o
c a l
e ,
E
S
I G
o f
M
O
0 2
v e n
t
a n
H
h
i m
. R
2
5 ( 5 ) ,
s
R
d
D
D
i g
D
2
S
a g h
h
I n
a t a
i m
0
u
R
g s
o f
M
C
o
r r e l a
t i o
n . A
M
i n
c t a
C
y b
e r n
e t i c a
i m
e n
t e r n a t i o n
O
A R 4
5
p o p
l
D
a n
A
i n g
- 8
T
C
c e
0 0
e c h
n
i q ue s .
D
9
M
Al g
I n
t e r n
o r i t h
m
U
K
S D
–
C
D
l us t e r i n
g
a t i o n
a l
9 .
P
r a b h
a t a
o n
T
I G
, 19
g
.
AC S
3
l us t e r i n
0
a n d
f o r
f e r e n
C
2
M
l o s ,
a t a
, .
C
p . 7 3 u
b us t 6 6
n
th
5
o
- 3
a k r i s h
n
o n
:
e
g , u
C
K p . 3
t h
n a
a l
C p
a m
G
s i o
a t a
.
i n i n
i t r i o s
D
D
l ,
r o c e e d i n
e h r k e , g
g
t m
y s t e m
a n d
r i e s . P
G
l us t e r i n
s e o k S
i s c o v e r y
e s
M
d
r k
a
l us t e r i n C
C
s u r v e y .h
a t i o n
e h m
D n
a n
G
um
f
0 2
f o r m
n e s
a n
C
e
l l e r m
ur v e y / b
l e d
a l ,
g s
C
r o
M
e c i l i a
a r s h
n a l
o f
P
j e c t e d
C
a n a g e m
H
U
J . B
n
a
g
e n N
r o c o p
V
t
o f
D
e r y
a n
L
a
19
a n
a k r i s h
R 2 0
a m
th
I n
c ,
9 9
P
p
p . 6
d
r g
J o e l
g .
a t a ,
,
i v e r s i t y , d
i u
l us t e r i n
a g e s h
f o r
t h e
S
e i
H t i o p Z
.
R
c e d
O
p
D
L
1- 7
l o k a
. W
o l f ,
r o c e e d
A
e
M
a k a r
i n i n g
a n a g e m
R
Ap
e n
t
a g h
a v a n .
p l i c a
o f
D
t i o n
a t a ,
s .
p p
.
2 , C
t a
S
i n 19
h
P
g s 9
o u
i l i p
S
t h
. Y
e
A
u ,
C
a n
d
S
I G
M
J o n M
g
S
O
o o
D
P
I n
a r k
t e r n
. F
a
s t
a t i o n
a l
9 .
d
e t s . T
h
o f
h
a r y .
e c h
n
M
AF
i c a l
R
I A:
e p
E
o r t
N
f f i c i e n
o . C
P
D
t
a
C
n
- T
d
S
R
- 9
c a l a
9
0
6
- 0
b l e 10
,
. n a n
t e r n
a t i o n
o
a k e s h
A
S
a m
P
p
V
t a b
S
r i k a n t . F
a l
C
o n
- 19 7
,
a
s t
f e r e n
a n
d
Al g
c e
o n
o
r i t h V
m
s
e r y
f o
L
r
M
i n
a r g e
D
i n
g
a t a
As s o B
c i a t i o
a s e s ,
p
n
R
p . 4 8
ul e s .
7
- 4
9
9 ,
t h
M C
i w e
i n o n
e n A
a n
d
D
i m
r o c e e d i n
i t r i o s
g s
o f
G
t h
u
e
n
o p
th
15
u
l o s . C
o
I n
t e r n
a t i o n
n s t r a
i n
a l
t - B
C
a s e d
o n
R
f e r e n
ul e
c e
o n
9 .
t l y a l
Y
o f
a l ,
s e s . P
19 9
f f i c i e n
g s
g r a w a
t e r n a t i o n
C
i n
g
L
f e r e n Y
i n
M
S
. I G
o n
c e
g
M
i n
M
P
o n
a
M
t t e r n a n
i n g
O
D
F I n
s
f r o
a g e m r e q
t e r n
m
D
e n t
ue n
t
a t i o n
a t a
o f P
a l
D a
b
t t e r n
C
a s e s . P
a t a ,
o n
p s
p . 8 w
f e r e n
i t h c e
r o c e e d
5
- 9
3
,
o ut
o n
C
M
i n
19
g s
9
o f
a n
8 . d
i d
a n a g e m
a
t e
e n
t
0 .
t e f f e n e x t
r o c e e d p
0
a
8
e i ,
2 0
a k r i s h P
A
T
D
J r . E I n
r o c e e d i n
e l ,
R s e
p . 18
D
p . 1- 12 ,
s .
e n
J i a n
n . P
o b
J r ., D
g ,
M
ul a t i n g
c t i o n
o
e ,
a y a r d
I G
a n ,
a t a ,
v a n
a r g
J . B
e r a
V
a y a r d L
g i n e e r i n
M
s t i n
. M
I n
s i n g
l us t e r i n
i n
C
Ac c um 15
S
2 .
J o h
P
g r a w
e r t o A
. J i a w G
E
ut e s . I n
o w
t h
o n
C
i n
n
s p a
o i l ,
A
0
J o h
a l ,
f o r
c e
e r t o
i n
o b t h
t k
U
4 .
o b M
A
e
g / l o g s u
S
U
K
e s t e r n
e s h
2 0
t i ,
g g a r w
c e
r o c e e d 9
d
t h
8 .
s
G
e
t f o r m
,
t a
o f
9
w
a j e e v
a
ub
g s
U
At t r i b
o n
. A
s p a
a k
o d
o f
e / e n
i n .
a n
S
m
ub
o r t h
11. R
14
C
o r i t h
3
R
g r a w
19
f e r e n
19
13
,
j a y
P
12
5
. T
g s
j .n e c .c o m
D
c e
d
l a
2
a ,
G
i n
o n
. R
h
a t i c
a r u
N
- 7
r i c a l
A
a n
S S
m
- 10 h
u o
r i c a l
a k
Al g
9
G
o
E
i n
f n .d
e r k h
a t e s h
t e g
4
a n
d i . P
0 5 B
r o c e e d
9 .
. 7
o n R
e y
.c e r t .d
a a r a n p p
a t e g
Aut o
8
L w
i p t o
e n a
C .
V
C
a n d
r o c e e d
: / / c i t e s e e r .n
r
C
7
g w
a v e l
f o
s e n
. P
3 .
: / / w
t t p
a n
t c h
9
( 4 ) ,
P
. H
a
19
i s t o
h 5
E
w
o l f g a n
R
.
,
t t p
15 4
S
5
W h
3
e n
i t h
o
n a i n
l i c a t i o n
H
e i n
c a b
ul a
a n d g s s ,
p
z ,
a n d
H
r i e s . I n J u
s t i n
o f
t h
p
. 2 15
e
Z
th
5 - 2
2
u
g h
f o r m
4 ,
E
a t i o n
o b
e l .
I n
t e r n a t i o n
19
9
7 .
P
. P
W
i l l i a m
r o c e s s i n
e r f o
r m a l
a C
n c e o n
s . g
L
I n - m
e m
e t t e r s , i n
f e r e n
P c e
r a
8 0
o
r y
c t i c e o n
H
( 6 ) ,
D
p o
a t a b
a s h
f
p . 2
T 7
1- 2
S t r i n a s e
S
g
a b
l e s
7 7
,
H
y s t e m
f o
2
0 0
a s h i n s
r 1. g
f o r