A Clustering Algorithm for Logfile Data Sets

Report 5 Downloads 165 Views
A Clustering Algorithm for Logfile Data Sets

R

i s t o

V

a a r a n

d i

Department of Computer Engineering T

al l inn T R

ec h

aj a 1

5 ,

nic al

T

r i s t o . v a a r a n

Abstract. T s tored

od

ay ,

v as t amounts

in l ogfil es .

T

h erefore,

s y s tem management tas k . l ogfil e d b

1

S

y s t e m

m

a n a p

p p

O

n

a l m

l i e d

c o m

e

p

d

e n

e p

l o g

g

t h

t h e r y

e n

h ic h

el s ,

e n .

T

B

o g

l i n

r f e r

o

e

t h

. e e

patterns

entify

and

h

from

paper pres ents one to d

i m

p

o n

f

o r t a n

e x

] ,

a

t y p

c o m

a s

a

g

e n

l o g

e x

e r i e n l t

m

s e r i o u

e

c a n s i m

etec t freq

eal th

information are

l ogfil es

a nov el

anomal ous

d i g

t h

p

e s c a l a t e

t

m

u

t s ,

o n

c e o n

e r a l

r u

s y s t e m

m

e n

o p

is

an

important

c l us tering al gorith

uent patterns

m for

from l ogfil es ,

to

l ogfil e l ines .

g

l e ,

a n

t e c h

p

e x

g

f a u

c e l l e n

u

e s

t

.

s o u

h d

a n s .

e

l o g d

w

i t i o n

s

f o r

d

f

i n

o n

s i d

g .

t h

e

-

u

e s

s

a r e

o d

a y ,

s y s t e m t

i s

e v e n

d

o m

r e l e v a n

h

s

i q

T

e r

r e l e v a n c e

n

c t i o n

o t h

e r e d

g

c t i o n

t e c h

i t o r i n o r

a l l

i n

a l f u n g

a l f u n

r e l e v a n

c o n

e t e r m

m

m

e r e

o

a r e

m

e v i c e ,

h

o t i o n

d

i t o r i n

d

f i l e

o r k

f i l e

n

a n o n

l t s

e t w

e

r c e

l t s m

p r o b l e m t h

n

l o g

T

c o n

f a u

s i s

,

s i b l e

f a u

p r o p r i a t e

s y s t e m

i q

e d

l t

o s s i b l e

s e r i o u

n

e n

l o g

a l l

a p

s y s t e m

r e h

a r e

p

e n e

o r e

g

f o r h t h

m

e r a t i n

t

W

o f

t o

c o m

e n

t .

o s t

i t o r i n

a

p

c e

i c a l

l o g

E

C

e

[ 3

l o g

] ,

-

o n

o r e s

p a t t e r n o r d

e r

l o

y e d

.

d t o

F

y

l y

.

I f

t h

e

t h

t h t h a

p

c e

p

p

a t t e r n

a r e

o s e

k

f a u

r e v i o u

c o r r e s p

o n

n

n

l t s

i n

t h

s e n

d

t h

t h

t h

a t

a r e

n

k n

o w

u g

m

e n

t s

e

w

s

p

a n

b

y .

h T

h

a l r e a d n

e s s a g

f a u e

M

S

i n

l t

d , c o m k

n

e

m

e d

e a l t h

i n

b y

t s

a i n t

a n

s t a t u

n

l y

t o

t h o

c c u

s i n

g

c e

g

i n t h

s p

d

s

o

e

f

o b i l e o n

n

t h

t e x

t h

p

p

h

a l

s

m

e

l o g

f i l e

i s

n

m

e

i n

o

p

a s

f

a s t

f o r

h

f

f i l e

o n

e i r

r o a c h

o

t u

o

l o g

t h

a d

a ] ,

e v e r y

e

p a t t e r n

a p

e r e

e

, [ 1

a t a b a s e

m

o w

r s ,

d

r e l y

e

a t i o n a t c h

e c t s

e ,

s y s t e m

t h

w

l i n

s e d e

S

l e - l i n

a t

e

t h

d

u

t o

i t i o n f i l e ,

e

t h

f o r m

. ,

s i n t h

f r o m

r i t i n o n

w

a s

s

i n

e . g

i s t r a t o r s

w m

e a l t h

r a m

a t c h

e s s a g m

o d

l o g

g

p r o g

s )

- m

i s

c o n

t h

l o g

h

f i l e s ,

p a t t e r n

a d

a n

y

l o g

o r

i t h

s y s t e m

e m

t

a t t e r n

S

s y s t e m g

a r e

s c r i p

l i n

e

o f

i t o r i n

a

e

r c e

o n

s e v e r a l

,

t o

s o u m

i s

d a t a b a s e s

o w

s l y d

g

( o r

f t e n

e

e v

t o o l

a r i n

o

t h f o r

a l l y

g

( e . g . ,

o s t

a s

e d o r m

a t t e r n

M p

n

i t o r i n

a c t i o n

e

a t

i n

c o m

a

f i l e s

e v e l o p

o n

b

c e r t a i n

e s

S

m

I f

l o g

d

e t c .

f i l e

f i l e ,

c r e a t e

o f

b e e n

i s t r a t o r ) .

t y p

e t e c t e d n

a i n

d e

e

I n

e

e m

e n

o r t a n

p a v e

m

a n

f l a w

l y

t h

o n

a r e

h

t e s

a d

c e

b e p

f a g

i l l

f i l e s

i m

t h

e c u

e s s a g s

o a n

s e r v i c e ,

p r o d

S

t o

s y s t e m p

,

t o

t h

[ 2

e d

i t o r

e

w

s y s t e m t

c e m

p

e y

l i c a t i o n

t o o l s

e s , d

p

e r e f o r e ,

s e

a d

f a u

e m

e y p

of s y s tem s tatus

h is

to id

i @

.

o f

s u

c o m

a b l e

b u

b e r

e

m

i n

and

d

mining

el ps

s y s t e m

fault message patterns.

t h

h

s u r v e i l l a n o f

o s t

i s

t ,

e c a u

e s s a g

m a p

t h

h

-

b e f o r e

e

t

g a r t

s y s t e m

g

s y s t e m

u m

m

o f

i n

d

e d

e

p

t o

e v

o n

c e r n

t

e a r l y ,

o s t

c o n

i t o r i n

o r t a n

e t e c t e d

L

l ogfil e mod

o n

i m

a r e

n

w

niv ers ity Es tonia

Introduction

i s

d

uil d

ata s ets

T

U

al l inn,

a l l o n

e

i s t r a t o r m

o n

a t c h

i t o r

f o r

i t

a t a b a s e . s o l v e

i r s t ,

t h

e

i s

p r o b l e m

s y s t e m

t h

a d

m

, i n

t h

e

f o l l o

i s t r a t o r

w

i n

g

c r e a t e s

m t h

o d e

e l - b a s e d d

a t a b a s e

a p o

p f

r o a c h f a u

l t

c a n m

e s s a g

b e e

A Clustering Algorithm for Logfile Data Sets

p d

atterns as usual. o

not

rep

T

resent

hen the sy stem ad fault

c ond

itions

messages ab out suc c essful c omp id

entified

me ssag mod

( if there are any ) ,

e

p

el of the logfile.

anomalou ow

k

ev

er,

rod

F

the task

time-c onsuming,

p

that d

o p

not

and

p roac h w

p rec ise if

the

or

ork

is

T

U

c urrently

s w

F

attern d

is

hus,

amp

( or c luster)

many

and y

if the d false

c ontains

hand

I t should

greatly

b e noted

file is small,

relativ ely

for it c an b e c reated

a few

little w

I n this p from

ap

er,

logfiles,

iv id

isc usses

and

s) .

W

related

w

ap

2

ool) . ork

p

on

d ata

ap

c lustering, d

s and

of

ious,

tools for one in

ata c lustering

T

s,

w

here

as d issimilar as ed

as ob j ec ts,

atterns form natural c lusters and

to b e analy z ed

erimental

sec tion 4

ted

generally b e d

w

issimilar w

ith a

el generation.

ith suc h a tool.

into the logfile,

herefore,

d

etec ted

this p ap

I f the

the mod

el

er foc uses on the

ifferent messages.

c lustering algorithm for mining p

he rest of this p

ata sets,

b e

c lustering

tool

c alled

er is organiz ed sec tion

3

esc rib es SLCT

as follow

p resents ,

and

SLCT

a

s:

new

sec tion 5

atterns

( Simp

le

sec tion 2 c lustering

c onc lud

es the

Related work on data clustering s hav e

b

een researc hed

many

algorithms hav e b een d

follow

s:

giv en a set of p

er to d

etermine,

istanc e func tion d

norm ( p

=

1

,

2 ,

. . . )

( x ,

ev elop

oints w

oints into c lusters so that p

I n ord d

T

oses a new ex

c ould

er.

Clustering method

p

as

v ariety

has b een d

loy ment of d

I f suc h natural c lusters c ould

ith a little effort.

an

e

tremely

hen logfile lines are v iew b ec ause line p

c ontain a large numb er of d

p resents

algorithm for logfile d p

and

w

id

ing the set of ob j ec ts into group

ifferent messages are logged

the author p rop

Logfile Clustering T d

d

w

ork

allev iate the p rob lem of logfile mod

manually

logfiles that are larger,

ed

en sourc e tools are av ailab le.

that not all logfiles need

or if only

oes not

ministrator

alarms

a

c an b e ex

lines that matc h a c ertain p attern are all similar to eac h other,

ould

it d

atab ase of normal

are similar to eac h other ( and

to lines that matc h other p atterns. it w

i. e. ,

it is essential to hav e method

no suc h op

ossib le to ob j ec ts from other group

are tool,

el,

if the sy stem ad

le,

lete,

larger

nfortunately ,

c lustering algorithms are a natural c hoic e,

softw

atab ase of normal

the message c an b e regard

ell only

el for it b

Clustering algorithms aim at d

ob j ec ts in eac h group

e. g. ,

atab ases c onstitute the

oes not fit the mod

or ex

inc omp

logfile

error-p rone. el c reation.

and

ac tiv ity ,

nc e suc h lines hav e b een

ealing c hoic e for solv ing this p rob lem is the emp

algorithms.

p

ap

of c reating the mod

artic ular area, ne ap

hose tw

3

all logfile lines that

sy stem

ministrator c reates the d T

el for the logfile.

urthermore,

automating the mod

O

el-b ased

is

O

5

to further p roc essing.

mod

atterns

.

messages,

this p

the sy stem ad

entify

normal

letion of transac tions.

I f a message is logged

d irec ted

a good

p

uc ed

reflec t

n fault or normal sy stem ac tiv ity ,

the mod

has c reated

p

now

s and

message

rather

at t e rns that matc h those lines.

rep resent any

H

ministrator tries to id

b ut

1

how y )

for the d

ed

[ 4

] .

ex T

tensiv ely

ov

er

the p

n

ith n attrib utes in the d ata sp ac e ℜ ,

oints w

find

ithin eac h c luster are c lose ( similar)

c lose ( similar)

is emp

ast d

loy ed

.

M

any

istanc e func tion:

ec ad

he c lustering p rob lem is often d

tw

o p

oints x

and

y

a p

es,

and

efined

as

artition of

to eac h other.

are to eac h other,

a

algorithms use a c ertain v ariant of Lp

154 R

i s t o

V

a a r a n

d i

n

d

( x , y) = p



p

x

p

− yi i

.

i =1

T

o d

o r i g u

s u

c a

a l l y

a l l y t e g

a

d

e s

') a t

d

i s t a n

l o g

f i l e

( 'C

o n

t h

e

w l i n

n

f o u

n

t h

h n

t o d

e x

i s t

2

0

i n

3

,

h

10

m

h

e

a n

d

i m

e n

b s p

g

:

g

c o

:

1,

1, b y

g

:

e

S A P

a

u

a n

]

s s w

b s p

g

d

e n

o

F m

1

9 2

e ne r a

t i o

u

t h

,

t h

s

8

d

t h

8 m

7

h

p

e

.

i s

i g h

- d

i m

l i e d

f a r

, c l u

i n

) ,

a n

s ,

h

a n

e r

e

,

g

e x

1,

18

i n

t h

e y

2

2

e

a l p

o t h

e r ,

s o m

e

( 13

,

17 49

a

i n

3 , ,

v e r y

d

1,

d

,

e n

9

a k

e s

t h

i s

s t e r i n o

s e

f

9 8

s p

e t s

m

1,

8

t h o i n

r c e s , c l u

a t a

e e n

s p

a c e s

8 0

a l

c a n b

A

t i n

t s

i c h

a l

e

o i n

a t a

o f

h

b s p 3

d

a v e

a i r w

t o t h

p

o i n h

s o u

s u

t s

p

e b e

i n

a t a

s

t h

e d

d a t a .

e r y

i n

o r i g

f o r m

a t a

o t l a r

c a n

f i l e

o d

n u

p l e ,

d

t r a d i t i o n

i s t

p o i n

d

e t h

e v

t o

e l o g

e r e

s i o n

f o r

a m

o f

m

o r e ,

d a t a

t h

h g

e a c h

a t

1,

w

i v i d

t h

i s

[ 5] ) ,

d

',

o r i c a l

o p

e s

e x

f r -

b a c k

p

o r

y

o

c a

i t

t

l i n

i t h

s e t

a t c h

b e F

b

e n

a t

i n

f i l e

e r e

w

c a t e g

s e v e r a l

t a t i o n

a l ,

e r m

c e

'h

e r e h

t e s

e r e f o r e ,

c a n

t e d

- d i m

f r o m

1,

h

t e .

s t e r i n

i g h

t h

T

w

e r e d

f o r

g h

a t

l i n

a r t

s i n d

t h

s i o n

u r t h

o r d

a t t r i b u

o n

o u

a t t r i b u

t h

c e ,

( 50

o d

h

t s

o i n

i v i c ',

m

t e s . T

p

o f u n

'C

( w

c o e f f i c i e n

c l u

h

s t e r s

d

e t h

d a ',

r e s e n

e n a l

F

s t a n

i t h

a t

l o g

( a c c o r d

8 ] .

w

s ,

J a c c a r d

c a s e a p

s e t

r e p r e s e n

t o e

d

t h

a c e s

o t e

n- t h r e p

t h

s i s t

c o m

e a c h

e

b e

t h

e

o d s p

a n

n

t s .

N

c e

s e

7

o i n t h

t h

o i t e

e t h a l

i t e

a t t r i b u

p

t r a d i t i o n

o r

o t h

o f

a s

o s s i b l e

m

e

9 3

c l u

e

2

5,

, 8

)

a c e

g

t h

a r e t h

e y

s t e r

i n

s p a c e .

t y p

. 1

u

[ 4,

e s c r i b e d

a b o

i c a l l y

l i n

l e ,

l d

a r e p

F

a l

e v e r y

. 1 6

] . 3 0

t h

c o u

a t u r a l

2

i s

p

t s

n

n

f o r

a t a

u

m

c o n

f i n

( 'H q

t a s k s i n

e

a p

15)

,

O

s e t ,

o f t e n

i m

i t i o n

o f

a m

t h

h

e ) ,

a n

l i n

e s

e

v

e

a r e

i g h

d

a l s o

- d i m

m

o s t

e n o f

r e l e v a n

s i o n t h

e

a l

l i n

t

t o

t h

( i . e . , e

e

t h

c l u

e r e

p a t t e r n

s

s t e r i n

a r e

u

g

o f

c o r r e s p

s u a l l y o n

d

t o

. 1

p l e t e

t i o e

o i n

,

3 6

c h

a r e

i s

n ≥

e r .

n c o

i n

p

t r a d

( s u

e a s y

i l l

t s

a t a

a d

i s

b e r

i s t

a r e i t

m

a n

a t e l y ,

e y

o s t

u

d

w

a y n

e s e

a t a

e nt i c a

s t e r

6 7

o n e x

f e w

d a t a ,

A

F

i t e m

a c e s ,

d

s

e

t o d

[ 7

n

e e n

s i o n

e n

s e t s

i s

o i n i t

g

i m

n f o

r

f i r s t

j o

h

d i m

n a c c e p

e n

s i o n

t e d .

o f

t h

e

d a t a

s p a c e ,

a n

d

e v

e l o p

e d

a n

d

P

d

c o r r e s p

o n

d

t o

t h

h

i g h

I Q

U

e

log: *. a l

M t

6

r o b l e m

o r d

W

p

l s o ,

v a l u

a t a t e

s t e r i n

- d

o r i c a l

a t a

t

8 . 1 . 1

e t e c t

s i o n

f i l e

6

a l m e n

y

o r

a

c l u

t h

d

a t a

a

f o r t u t h

d

d

a s

2 . 1

e n

o t h e n

p

l o g

a c e s .

r d

p a s t

s i o n

a n u

i m

a l i t y

g

a t u r a l

r i n

d

9

s p a c e

a n

o t

o r i c a l

n

t o

e x n

d

a t t r i b u

A

b e t w

a t a

g

d

c l u

l o w

y

c a t e g

i f f e r e n

d

s e t s

h

a n

c e

f t e n

U

w

2 4, m

d

. 1. 1' ) .

e r e

a l 7

o

s t e r s

e a c h

3 - 4 w

k e y

p a t t e r n

e n

t o

nne c t i o n f r o

R

h

a b l e

s t e r

a v e

c r e a s e s ,

s e v e r e

c e

s t

w

n

'r e d ') .

a t a

c l u

i r d

s i n

j u

s y

a

r ,

1

h

e r

a l

i n

a n

a n

',

t e s . w

o f

l o

8

n i n

s i o n

s i o n

a t a ,

d

e l l

e n

t h

s u

D

s u

d

d

y

i t e

a n

m

. 16

m

u

c o

d i s t a n

i t i o n

q

a i n s i d

t r a d d a t a

d

s e r v i n

2

s

u

c l o s e

i n

l o

[ 7

d

a

a n

f o r m

f r e q

- d i m

t h

l o

i m

b e

s t e r s

l o

d

t o

v e r y

o r e

l i n

a n

o t

f i l e

c l u

s i o n

( 12 ,

T

w

s i o n

e n

c l u

s e c o n

a n

o m

f o r

e r i c a l

's e d

i s

n f r o ' 19

d

c a t e g

a t t r i b u

o f

a s

t h

l o g

m

44) ,

n

e

e r .

s e e n

a r e

',

o f t e n

i g h

a

e s

i r s t l y ,

o r i c a l

o r d

g

u m

c o n

h

c t i o n

w

n

e

e

c a t e g

f u n

o r k

i m

a r e

a l

3

o t

s

t h

f r o m

e n

s t a r t s

o d

c a n

r e

a p

w

d

t s

f o r

F

a n

o i n

t

a l l e n g

p l e ,

e a s u

o f

i m

a m

p

i t e s

e t e c t i o n

e t h o r i g

t e n

o f

d

p u

t h

s ',

'f r o m

q

e r e

t y p e ,

nne c t i o

i s

o t

r o b l e m

m

n

l y ,

a v e

d

e

p

t h

d

) .

o c u

n- t h

o ',

b e r

e r e

t h

C

o f

e c o n

u m

e

e

10

'F

s

t s

t h

e c t i o n

e a s i l y

n

o i n

e x

e l ,

t

r i g h

p

w h

c h

s t e r i n

',

m

c t i o n e

i t h

r e s t S

n t h

a s

s ,

d

o r d

t o

a j o r c l u

w

a n

o

m

e l o

t e s ,

i f f e r e n

f

e d

o r d

b

s

m

( 'F

f u o

A

o w

c e

v i e w w

d

h

o i c e

] .

o

f o r

e l l

r e r ,

d

s

t w

e d

a t t r i b u 6

a n

t h

a r e n

w

l

f a c t u

b v i o u

c h

e s i g

[ 5,

nu

a t a

e r e

d

r i c a

'g r e e n

o

t h

n i s

o

v a l u m

a y ,

i n

a n

I A s e t s d

y e a r s , l i k [ 9

e

C

s e v e r a l L

I Q

U

]

a l g

o r i t h

[ 10

] :

t h

e y

t h

e y

h

a f t e r

E m

a l g ,

s

M

F

I A

c l o s e l y

s t a r t a v e

o r i t h A

i d

m

s ,

h C

a v e A

C

r e m

w

i t h

i d

e n

t i f i e d

i n e n

b T

d

e e n

U

S t h

t i f y i n

c l u

,

s t e r s

e

A

g

p

C

a l l 1,

R

r i o r i

O

f o r L

a l g

c l u . . . , C

C

m

s t e r s i n

U

c l u S

.

o r i t h i n ( k - 1

s t e r i n T

h

m

e

C

g L

f o r 1- d i m ) - d

i m

m

i n

E

i n

g

e n

s i o n

a l

e n

s i o n

a l

A Clustering Algorithm for Logfile Data Sets

sub sp and

ac es,

effec tiv d d

they

e in d

ata sp

to id

ac e.

w

een ind iv id

I nstead

entify

d

e n

s e

nfortunately ,

lik

e

c and

id

d

ata and ass

b uild

d

generate

isc ov er p

method

[ 1

n

1 ,

2

d ata U

E

,

c lusters

w

]

for d

e n

1

M

3

] .

T

s i t y

b

a

s e d ,

and

tw

inv olv es U

w

T

ex

p

d

shap he P

es,

er the d

R

O

w

hic h

CLU

c lusters in sub sp

is

relies on these p rop

3

and

d

ata. then w

t to measure

I n

high er the

ac tual

c lusters.

]

e p

red

if

one

uses the K ac e.

ic ted

H

ow

w

it tend ants -med

oid

ev er,

ac c urately ,

s to

in

and

.

largely t

and ass ov

is therefore fast,

und esirab le

b

ity

uring the sec ond

of

ist for high-d imensional d

the nex

e w

lex

es a p

ates d set

S algorithm [ 8

hat is the right v alue for K

logfile

id

the

ac es of the original sp

suitab le for c lustering logfile lines,

ata,

c omp

first mak

ata and

v ious w

of

]

etermines

asses ov

hough sev eral c lustering algorithms ex

the nature

o not attemp

onential

therefore it is not ob

erties of logfile d

hose algorithms are

here a c lustering algorithm tries

S algorithm [ 6

finally

o p

ac es from C1, . . . , Cm ,

T

forms c lusters from those regions.

ata the numb er of c lusters c an rarely

rop

d

then generates c luster c and and

stretc hed

etec ting K

ec ause they

5

I A algorithms suffer from the fac t that Ap riori-

he CACT

es only

ith

AF

b

ac e,

testing

atterns from logfiles.

[ 4

ac c ount

ata sp ac es,

b ec ause they

sec tion,

w

e

w

ill

d

on' t tak

first

they

e into

d isc uss

the

ill p resent a fast c lustering algorithm that

erties.

Clustering logfile data

3.1

The nature of logfile data

he nature of the d

ata to b e c lustered

algorithm for c lustering. generic

d

the w

ord

generic F

ata are mad

lev el,

e.

H

there are tw

p

lay s a k

ey

role w

hen c hoosing the right

ost of the c lustering algorithms hav e b een d

ow

et b ask

et d ata,

ev er,

hen w

o imp

w

ortant p rop

w

here no sp

e insp

esigned

for

assump

tions ab out the

ec t the c ontent of ty p

ic al logfiles at

erties that d

ec ific

istinguish logfile d

ata from a

d ata set.

irstly ,

most of the w

the results of ata.

M

ata sets suc h as mark

nature of d

d

and and

summary ,

S mak

ac es,

5

hic h is often meaningless in a high-d imensional

in the d ata sp

U

1

w

p roac h is d

s

s a d ata summary ,

the

are not v ery

T

i o

oints,

imensional sub sp

id ates are ac tual c lusters.

the c ase of logfile d

T

p

their ap

generation

Although CACT to

ual p

the CLI Q

ate

using

,

r e g

runtime ov erhead

p

id ates for k-d

hic h of those c and

isc ov ering c lusters in sub sp

istanc e b etw

U

form c luster c and

then c hec k

1

an ex

p

ord

s oc c ur only

a few

times in the d

ata set.

T

eriment for estimating the oc c urrenc e times of w

ab le 1 ord

p resents

s in logfile

156 R

i s t o

V

a a r a n

d i

Table 1. Occurrence times of words in logfile data

Data set

Data set

T

o

si z e

tal

d

#

o

w

o

r d

# f

i f f er en

o

o

t

f

w

c c u o

s

o

r d

n

#

s

r - r i n

o

o g

f

w

o

c c u

c e 2

r d

s

r - r i n

ti m

#

es o

o

o g

f

w

r

3

1

s e r v e r lo

g

f ile

( L

2

ac h

1

0 8

, 1

g

f ile in u

A

u

1

n

g

8

0

0

h

e

o r d

8

4

8

, 0

9

3 3

. 9

1

%

)

, 3

5

( 7

0

9

, 5

. 4

ti m

1

)

, 4

0

( 8

4

2

, 7

. 6

4 8

%

#

es o

o

o

r

f

w

1

)

, 4

4

( 8

3

4

, 1

. 8

1

0

r d

s

r - r i n

ti m

#

g

o

o

es o

r

f

w

1

)

, 4

7

( 8

2

6

, 2

. 6

2

0

r d

r - r i n

ti m

s g

es o

r

l ess

9 6

%

o

c c u

l ess

5 9

%

o

c c u

l ess

l ess

8 1

%

s g

1

, 4

)

9

( 8

3

7

, 1

. 8

6 0

%

)

1

, 8

8

7

, 7

8 0

1

, 0

2

( 5

8 0

4

3

, 0

. 2

2 9

%

1

)

, 2

5

( 6

0

6

, 6

. 3

9 7

%

1

)

, 3

5

( 7

2

9

, 5

. 0

3 5

%

1

)

, 4

5

( 7

7

6

, 4

. 2

8 9

%

1

)

, 5

6

( 8

8

3

, 1

. 1

6 5

%

1

)

, 6

9

( 8

9

5

, 3

. 8

3 6

%

)

. 9 1

MB , 8

4

, 0

1

6

, 0

0 9

3

8 3

, 9

( 9

4 8

8

, 4

. 3

1 4

%

3

)

, 9

( 9

4

9

8

, 7

. 4

7 3

%

3

)

, 9

( 9

5 8

0

, 4

. 4

3 9

%

3

)

, 9

5

( 9

8

1

, 4

. 4

9 2

%

3

)

, 9

( 9

5

3

8

, 6

. 5

9 8

%

3

)

, 9 ( 9

5 8

6

, 8

. 5

5 0

%

)

e s

l t s

o f e d

w

g

g

,

o f

s p

r i n

W

h

t o g t h

u

r

m

s e s

I n

e

t h

i d

w i s

e ,

g

" C

f

e n

e x

t

s u o f

f o r m

c o n

e

s a m

t y p

l o g

t s ,

w

f i l e

d

h

g

e

a n

f i l e

d

a t a

e n

t l y .

T

u

p

o r t

l o g

a l l w

u r i n

a c c o r d

s

a r e

e

v e r y T

f r e q i l a r

p h

e r i m

e n

e x

p

i s h

i n

i n s i m

g

t h

i s

i s

t o

a

u

a t n

e n

t h

o t

t ,

a n

e n

o m

t

n

d

s i g n o n

e a r l y

e r e

a r e g

f o r m

i f i c a n a s o

y

s i n

f

t h

e

s t r o n

c e

s t r i n

t

b e e n

%

a n

,

a t

h

50

m

s u r p r i s i n

c e r t a i n

a e n

g

b e f o r e

g ,

w

h

e r e

e . g . ,

%

e

a r e s e t .

d

f r e q

t a i n

b s e c t i o n l o g

f

f r o m

c o n

e r e ] .

a t t e d

s t a n

e

i c h

h

r

s

d a t a

[ 14 o

o c c u

o r d

e

w

l y

e r t y

a t

w

t h

a t a ,

e c t i o n

h

f

i n

o n

e r a l l y

n

w

e r t i e s

t h

a r e

t h

d

c e

p r o p s

o

c e

e b o n

t

o n

o

a j o r i t y o n

W r

g

s e t n

e

o r d

s t r i n

e s

e

m s t

o r t a n

e

a t

p r o p

t h

h

e

t h e r e b s e t

e

t o

d

e n

d

a t a

o i n

t

d

s i t y

s p

w

a c e .

n i s

t h

o f

e

a c e

m d

a t a

e

g

p

d

e d

o f

i l l

%

t h

" ,

m

i p

a n

e

y

c o n

r e s e n

t

a

a d

d r e s s ,

t i m

s t a n

c l u

p

e s , t

o r t n

t h

e r e

p a r t s

o f

s t e r i n

g

a l g

u m

w

b e r ) ;

i l l

a l s o

b e

e

f o r m

a t

t h

r e l i e s

t h

o r i t h

m

a t

m

a n

y

s t r i n

g

o n

a t a .

l d

o r i t h

m l i n

e d

u m

s p a c e ,

n w

f o r t o

t o

u h

m e r e

h

i c h

c l u f o r m

c o n a

o n b e r

d

s p

t a i n

d f i l e

o f

c e r t a i n

l o g w

t h

s t e r i n a

g

o u

o r d

l d

b e

s t e r s

o n

l o g i n

w

c l u

r e l i e s

f r o m

c o r r e s p

w

e t e c t

m

e r e d

e

m d

p r o a c h s i d

a e

o r i t h o u

a l g

a s s u t s

i m

w

a p

c o n

t h a x

a l g

i c h h

i s

f r o m e

T

a r e

r e p r e s e n s

a n

h

b a s e d

s t e r s

o r d

t h

e s i g n

a n

s p

c l u

w

d

a t a ,

d a t a

e

p

a s d

a l

e a c h

h

p

d a t a

a j u

o c c u

e e n

e s s a g e

w

t h

e t e c t e d

a r e

i m

e s s a g

t h

a i m

i n

T

d

a t

e a r W

t o

e s s a g

t f ( m

p

The clustering algorithm

v e r

o r i g

d

f o r m

e c i a l

3.2

O

m

e r .

s p

n

b e t w

e

i n

e t h e

a t h

e n

e s

s

t h

a p

o r l d

f o u

s e c o n

i n

o w s

W

e r e

e

a r t s

l i n

s h

o r d

f o r

w

h

l o g

w

4 0

( 4

MB , 7

3 9

r e s u

s

T

s u

9

4 , 8

c o r r e l a t i o n

u

, 8

r d

)

b s e r v

d

0

5 r

o

r - r i n

in 0

w

o

0

w

e s

lin

T

p

. 9

8

0

4

f r a c t i o n o

, 7

4 8

f

f ile

( W 2

8

lin

s e r v e r lo

1

MB

, 1

es o

o

c c u

x )

t h e n t ic

at io

7

# o

e s

e

( L

. 3

5

s g

x )

s e r v e r lo

5

, 6 lin

in u

C

0

7

r d

r - r i n

ti m

l ess Mail-

o

c c u

t h

e

s p

g .

P

d

p

f i l e s

o i n

a t a

p

a t t r i b u

t s

e .

e r

l i n i

T T

m

a k

e

t

i n

a t

d o

e r t i e s

o f

n

o n

o f

o t

c a t e g o r i c a l

d a t a

i n . . . , i

t h k

l o g

b e l o n

a t t r i b u

e

a

b s p

g

f e w

p

a s s e s

a c e s

f i l e

d

t o

a n

o f

t h

a t a ,

a n

y

o

f

e d

t h

e

outliers.

h

e

l y

s u

i t h

h e

1,

t h

w

d

p r e s e n p r o p

s t e r

s e t .

l i n

t e s

t s

c l u

a n

a r e

e c i a l

o i n

e c i a l

a t a

f a s t

a t

e

( 1 ≤

d

s p

t e s a c e

a t a k

o



a t t r i b u f

h

t e s ,

e a c h a s

s e t .

A

n)

o f

w

d

a t a

n d i m

e n

reg

ion S

a l l

p

o i n

h p

e r e o i n

s i o n

t s

t s ,

i s

a

t h

a t

A Clustering Algorithm for Logfile Data Sets

b elong to S hav e id { ( i1, v fix

1)

ed

, . . . , ( ik , v

) } k

attrib ute) ,

at least N T

p

it first mak

another

p

ass

b efore.

to

d

ense 1

mining) .

ord

ense 1

all c luster c and tab le w

c luster ,

ord

is c onsid

p

hic h is initially

freq

sup

-regions ( freq

id ates d

ty .

T

b een d

isc ov

not p resent in the c and id ,

otherw

ise its sup

p

to the c luster c and id line b elongs to m c luster c and ex w

amp

le,

ate.

T

ith the fix

ed ,

,

ith sup p

regions that are guaranteed B p

ec ause

of

the

attern,

e. g. ,

d

' authentic ation' ) ,

( 3 ,

of

c luster

w

' for' ) ,

( 5 ,

ay

b

y

j ust p

6

T

S

then

information id ates.

the algorithm id u

entifies

e n

t

w

o

r d

s

uring the

times in the d

entified

,

ata set,

the algorithm b uild

line b

y

line,

.

t in the c and id and

1)

ed

s hav e

there ex ense 1

id ate is

ort v alue

ing w

, . . . , ( im , v

attrib utes { ( i1, v

another d

ord

the line is assigned

in the follow

and

s

ate

hen a line is

ith a sup p

I n b oth c ases,

. 1 ,

w

uent w

I f the c luster c and

attrib utes ( i1, v

. 1

ep

one or more freq

ed

8

the c and

ense)

are rep

the

' ac c ep

eac h

1)

if the

then the ) } . m

ense 1

-region w ,

ay : ) ,

m

, . . . , ( im , v

ist a d

attrib utes { ( 1

id

F

or

-region

ith the fix

ed

' Connec tion' ) ,

ate tab le is insp

set

ted

') }

of

he CLI Q

U

E

orted

fix

ed

c orresp T

b

c luster

w

ec ted

p ort threshold

y

( 2 ,

ond

s

{ ( 1

to ,

s to the line p

a

'P

and

all

c ertain line

assw

attern

the algorithm c an rep ithout rep

algorithm rep

,

v alue ( i. e. ,

the algorithm as c lusters.

c orresp

attrib utes

ond

hus,

rinting out line p atterns,

b elong to eac h c luster. [ 7

ed

and

region,

ith

U

and

ate.

e d

a

summary

into the tab le w .

ith the set of fix

authentication for * accepted. a c onc ise w

the

ual or greater than the sup

to b

efinition

the

. 1

of the algorithm,

ort v alues eq

the user.

en into ac c ount d

id ate is formed

' Connec tion' )

b ec omes the c luster c and id

regions w

roc essed

ith the set of fix

then a region w

During the final step

y

s a d ata summary ,

id ate is formed

2

e c all the set

there is j ust one

from the set of c and

hav e b een id

1 9

W

7

is a region that c ontains

he c luster c and id ates are k

m

. k

hat similar to the CACT

using

ill b e inserted

f r o

v

( i. e. ,

5

v alue.

ata set is p

it w

=

ik

uiv alent to the mining of f r e q

ill b e inc remented

n e c t i o n

attrib ute ( 1

' from' ) ,

s)

T

he c luster c and

n

i o n

ata summariz ation) ,

a c luster c and

id ate is a region w

x 1

giv en b

b uild

id ates,

-regions that hav e fix

o

e

ata and

is eq

ord

ass.

ate tab le,

ense 1

r e g

l u

ense 1 -regions ( i. e. ,

ort v alue w

if the line is C

attrib ute ( 2 ' from' ) }

d

c and

s e

v a

. . . ,

I f k =

uent if it oc c urs at least N

he d

to b elong to one or more d on the line) ,

1,

is somew

ort threshold

uent w

found

ered

p

uring one p emp

v

osition in the line is tak

ered

ec ified

o l d

c lusters are selec ted

ote that this task

=

i1

A d e n

er the d

of the algorithm ( d

N

is the user-sp

After d

b uild

ata set ( the w

A w

here N

o

es a p ass ov

As a final step

-regions.

p

n .

and

from the d

1

- r e g i o

s,

During the first step

w

1 p

x

of region S.

he algorithm c onsists of three step

es

is the s u

∈ S,

∀x t e s

r e s h



here N

: k

u

t h

]

w

. . . , v

t t r i b

r t

c ollec ted

all

1,

a

the region is c alled

oints,

algorithm [ 6 mak

entic al v alues v

the set of f i x e d

1

orting ind

ord

') ,

( 2 ,

Password

ort c lusters in

iv id

ual lines that

orts c lusters in a similar manner

] . T

he first step

for mining freq itemsets.

T

of the algorithm remind uent itemsets [ 1

hen,

how

ev er,

0

] ,

s v ery

sinc e freq

our algorithm tak

all c luster c and

id ates at onc e.

algorithm is ex

ensiv e in terms of runtime [ 1

and

testing inv olv es ex

logfile d mak

p

es

p

T

little

onential c omp

sense

to

test

c omb inations that are generated c omb inations are p

ord

the p

op

ular Ap riori algorithm

s c an b e v iew

es a rather d

ed

ifferent ap

b

lex

ity .

1

2 ,

1

Sec ond

3 ] ,

sinc e the c and

ly ,

Ap

p

otentially riori,

resent in the d ata set.

w

huge

hile only

een freq

numb er

of

a relativ ely

uent 1

-

generating

F

irstly ,

id

ate generation

sinc e one of the p rop

strong c orrelations b etw a

y

1 ,

as freq

p roac h,

here are sev eral reasons for that.

ata is that there are many v ery

c losely

uent w

uent w freq

Ap

riori

erties of ord

uent

s, w

it

ord

small numb er of

I t is muc h more reasonab le to id

entify

the

158

e x

R

i s t i n

w

h

m

N

o t e s

u

v e r y s t i l l

c o m

b i n

t h

a n

e l y

a n

g

d

u r i n

o n

d i d

d c e

n

o t r a t h

f i t

t o

t h

i n

o r t a n

t

i s s u

a

o

u

f

e ,

m

a i n d

y

a s s

s t r o n

e r

s u m

p

o v

t h

e

e r

t h

e

d

c o r r e l a t i o n o n

a n

t h

o r t

t h

o r y .

H

n

g

a t

t h

p

e m

i n

p

e r a t e d

l a r g

l o w

a n

a n e n

c h

g l e

a t a ,

a n

d

v e r i f y

a f t e r

t h

e

p

a s s

s t e r s .

m g

e r e

s i n

c l u

a r e

m

a

g

t o

a t e s

i v e n

t o p

s

p r e s e n

o f t e n

a s

i m

e c a n

i s

l i k

a t i o n

t h i f

d h

d i

c o r r e s p

a t

a t

s e r

a a r a n

e m

t h t h

e ,

e

V

o f

e a n l a r g

t h

d

g

i c h

i s t o

e x

t

c e ,

e

n

r e s h o

s

e i r

u m o l d

w

s u

t h

b e t w

n

u m

b e r

o f

v a l u

e ) .

e v e r ,

t h

b s e c t i o n

w

e

f r e q T

m

h

w

i s u

f r e q

u

o t

l i k

n

e n

t

w

o r y

i l l

d

t h

c o s t

i s c u

e n

o

s s

s

w

o r d t o

c a n t h

d

e

i s

s

b

m

a l s o

e

i t s e l f

e f

t h

t

e l y

o r d

e r e f o r e ,

e m

e

e e n

b e r

v e r y

( u n

i d

l e s s

a t e s

a l g

o r i t h

a t t e r

i n

a r e m

i s

m

o r e

e t a i l .

3.3

The issue of memory cost

A

l t h

t h

e r e

a l g

o u

h

e

m

m

s i z e s

T

h

e c k

d a n

r e s u l a r i e s

m

r o w

a i n

s ,

m

e m

e

h

e t h

d

a t a

e r

i n

t h

e

a t a

e t o

s h

o w

l d

o c c u

y m

o s t D b

y

a n

i t s

v o

c a b u

t

s t r u

a t

d

e r i m

u

n

s

e r

t h

s e



u n

g

e

w

f o r m

r r e n

d a t a

ac h A

u

o r y

v

o

c a b

u

l a r y

t h

e

a l g o r i t h

0

O

0

0

n

w

o r d

i s

a

e

o t h

a r e

I n

o r d

o c a b u

o f

c t i o n

a t e

e r

v e r y

a s t e

s t r u

e s t i m

e r e f o r e

f a s t ,

s t a n

t h

m

c e s

e

r t h

m

e r ,

w

t o

o r d

i t

p

n

l e m

g

d

o

t h

f e

v o

e

i n

t e d

g

m

i l l

w

a s o

1G

B

e m

o r y .

s d

c a b u

c a b u

b e

i n

c o n

- m

e m o v ] ) .

e n

s u

m

t h

l d

n

o t

s

t h

a n

e

d

i f

1.

I f

t e d . a o

i n

s i z e

,

t o

l o t

c a b u

e - t o - f r o n

e

c o u

o r d

e v

s t e p s e e k

s e t

s e t s ,

s

m

l a r y ) ,

t e r

o r y

m

f i r s t

w

c r e m

[ 14 a t a

l a r y

n

t o

A

e

o r i t h

e a c h

c o u

a

r d

o r

v o

c e

t h

a l g

F

( o r

e l y

t h

e n

- s i z e

w

l i k

i s

e

s .

r r e n

t e r

m

t h

o r d

t a b l e

i s

l a t i n

y t e s

w

o c c u

c o u

u

i u m

a n

i n

e a s u r i n i m

a b

e

,

t

- m

h

t h

f i t

i n

f

a s h

e m

o f

o

l a r y

o r y

e

d

a t a

t o

t h

e

i z e

T

o

tal

#

o

x )

1 x

)

0

1

2

0 1

5

8 0

8 4

3

. 3

MB

, 7

, 6

. 9

MB

, 8

, 1

. 9

MB

, 4

, 8

5

7

8

9 9

1

, 1

4 8

lines

, 7

8 0

lines

, 8

8 3

lines

f o

d

i f f er en

r d

t

T

h

s 1 1

, 7

0

, 8 4

e si z e o v o

0

, 0

8

7 1

6

c ab

u

f

th

e

l ar y

, 8

4 0

, 7

8 0

1

5

9 3

8

MB

MB

, 0

0 9

2

1

4

MB

)

t h s

w

c o n

v

logfile (Linu

e server logfile (Linu

in2

t h

c i r c u

s i z e s

S

t h ent ic at ion server logfile

(W

i s

c e r t a i n

w

C

d

e r

o r e .

Data set

Mailserver

a n

d

a r i z a t i o n

i t s

c e

a s

e g

m

s e t ,

a c c u e d

m

e

i t h

m

o f

m l i n

t h w

f o r

l a r y

f u

i n

a t a

t

a r t s u

e a c h t

d

p

a t a

l a r y

e n

o f

e t e r i o r a t e s

d

o c c u

f o r

d r e d

o v u

s i v e e

e

c t u r e

e v e n h

e n

l i t t i n

l a r g

( e a c h

y

i t s

p r e s e n

l a r y , a

e r

t h

c a b u

p

d a t a

p

g

s p

v o

f o r

a s s e s

d

e x

i s

e

p

i n

u r i n

o r d

c a b u

o h

o r y .

e x

t h p

m

t h

i l t

t s

s e t s

e

w

t w t

e m

s e t ,

i n

b u

e f f i c i e n

m

i l t .

t h

v o

i s

s t i g h

o f

t h

b u

s e r t e d

j u m

l o t

i s

p r e s e n

a n

e m

a

e

s i t u a t i o n

o r y

Table 2. In- m

t h

t

e s

i c h

c o s t ,

i n

l a r y

c o u

a k h

e

a r y

w

l t s

t h

m

m

i n

2

i s

e

g

s

b e

r e e

o c a b u

s e t

s

a b l e

i c h

m w

o r y

m

p r e s e n

T

m

s u

e m

o c a b u

t h

h

m

i l l

v

f o r w

c o n

o r d

i s e

l d

w

c h

o r i t h

p r o b l e m

s u

w

o r y .

t a b l e

e

a t a t

i t

t h

e m

v

e n

o r d

I f

a l g

o f

d

u

't ,

w

s

e

o r i t h i s n

r o n

c o u

t h

f r e q

a l g i t t h

m

t e r m

e n

f o r

o u

s t i l l

o r i t h I n

w

g h

i s

w

l a r y

s p w

e r

h

t o h

a n

i t h

d ,

f r e q

a c e . i c h

o u

n

w

t

T

t h

h

s

w

i s

f i n

t

i t

a l l y

e r t i e s

w

not o r d

i s b

t o s

o

s t o r i n

i n , b

i n

g

i m

e

p r o b l e m

need

i r r e l e v a n

p r o p

a t e l y ,

i l l t h

e

e r e f o r e ,

f o r t u n

i t h s

o f

t .

o r d

o r d

e

e n

U

e

w

o n u

w

c o p

i c h w

h i n

p

w

i t .

e

l o g

f i l e

o s e

u

u

e n s e

t o

a t a

p

i n

i s

t h

f r e q

r e d

u

i c t

a t

d

e n

a

t

m

w

u r i n

a j o r i t y

o r d g

t h

s e

o f

t h

t o

m

e m

o r y

e

v o

c a b u

l a r y

t . t h

e

s t o r e d

i n

B

t h

e f o r e

d

v e r y

o s s i b l e

f r e q

e

f t h

f o l l o

e

m

e m d a t a

w

i n

g

o r y , p

a s s

t e c h a n i s

d m

n

i q u t h

e n

a d

e

e

-

w

e

f i r s t

c r e a t e f o r

b u

i l d

t h i n

e g

A Clustering Algorithm for Logfile Data Sets

the v oc ab ulary , summary from 0

the algorithm mak

v ec tor. T

to m-1

)

w

he w

ord

es an ex

summary

ith eac h c ounter initializ ed

fast string hashing func tion is ap v alues from 0

to m-1

,

and

c ounter in the v ec tor w uniform [ 1 1

/ m,

5

] ,

i.e.,

p

lied

the p

is the numb er of d

ifferent w

ord

the w

oc ab ulary ,

s w

ill c orresp

1,

b ut only

those w

ord

s that d

q w

to b e b elow

..., w

ow

ery

infreq

c ounter v alues w v alue

has

effec tiv

uent w

b een

sp

ec ified

eness of the w

d

he

ex

p

eriments

ramatic ally

( d

uring the ex

the ex

memory

ord

amp

le,

) .

T

p

,

uc es v

p

eriment,

ab le

4

b

p

3

er the d

for a w

ord

,

ata,

a

the i -th

s w

p

..., w

...+

m w

ord

uent,

is

here W

ord

ec tiv ely ,

s that

then the

tk .

for w

giv en b

ing the

hic h their

y

the user.

b ec ause their oc c urrenc e

uent,

an

this simp

le tec hniq ue is

of the c ounters in the v ec tor

ith them,

v ec tor tec hniq

w

.

infreq

resents

s,

are all w k

resp

ort threshold

ort threshold

p

1,

/

times,

a maj ority w

to W

the algorithm starts b uild

ort threshold

s assoc iated

summary

and

therefore most of the

( unless a v ery

ex

p

eriment

ue for three d

low

for

threshold

measuring

the

ata sets ( eac h c ounter

y tes of memory ) .

suggest

red

req

ord

ass ov

into the v oc ab ulary

s are v ery

ill nev er c ross the sup

in the v ec tor c onsumed T

ord

erful. I f the v ec tor is large enough,

ill hav e v

ord

uals to the sum t1+

the sup

ord

he func tion returns integer

roughly

oc c ur t1, ..., tk k

ill b e inserted

of the w

s a w

string hashing to a giv en v alue i

ond

ual or greater than the sup

iv en that a maj ority

uite p

s w

. T

o not fulfill this c riterion c an' t b e freq

times are guaranteed G

ord

b uild

9

of m c ounters ( numb ered

is c alc ulated

s in the d ata set. I f w

ord

ata and

5

. Sinc e effic ient string hashing func tions are

v ec tor has b een c onstruc ted

c ounter v alues are eq W

ord

of an arb itrary

v alue of the i -th c ounter in the v ec tor eq After the summary

e up

to z ero. During the p

eac h time the v alue i

rob ab ility

and

ass ov er the d

to eac h w

ill b e inc remented

then eac h c ounter in the v ec tor w

hash to the v alue i ,

v

tra p

v ec tor is mad

1

that

oc ab ulary

the

siz es,

v oc ab ulary

uirements

for

the largest v ec tor w

emp

loy ment

and

large amounts of memory

siz es d

storing

e used

d

ec reased

the

2

of

5

v ec tor

uring the ex

p

the

-1

0

0

w

ord

summary

times) . O

itself

are

w

ec tor

n the other hand ,

relativ ely

eriments oc c up

v

ill b e sav ed

ied

small.

less than 4

0

F 0

or K

B

of memory .

Table 3. The effectiveness of the summary vector technique

Data set

S

u

th Mailserver

logfile (Linux)

C

ac h A

ut h ent ic at ion server logfile (W

in2

0

0

0

ac h A

ut h ent ic at ion server logfile 0

0

0

ac h

e server logfile (Linux)

A

ut h ent ic at ion server logfile 0

0

(4

. 1

0

%

. 1 0

%

. 1

logfile (Linux)

C

in2

(8 %

o

r t

6

o

, 5

1

7

, 8 8

1

9

7

, 9

V

ec to

l d

1

8

r

T

si z e

d

o

tal

i f f er en

#

o

t w

f o

# r d

s

o

th

f

e v o

w

o

r d

c ab

s i n u

R

l ar y

ed

u

c ti o

f ac to

n

r

)

5

, 0

0

0

1

, 7

0

0

, 8

4 0

4

0

, 9

3

5

4

1

. 5

4

)

5

, 0

0

0

1

, 8

8

7

, 7

8 0

1

8

, 9

9

8

9

9

. 3

6

)

5

, 0

0

0

4

, 0

1

6

, 0

0 9

3

3

. 9

7

1

1

8

, 2

0 8

%

(7

, 6

(8

, 1

(4

, 8

5

7

8

9 9

1

)

2

0

, 0

0

0

1

, 7

0

0

, 8

4 0

6

1

, 2

4

4

2

7

. 7

7

)

2

0

, 0

0

0

1

, 8

8

7

, 7

8 0

5

0

, 2

4

6

3

7

. 5

7

)

2

0

, 0

0

0

4

, 0

1

6

, 0

0 9

7

3

, 3

7

6

5

4

. 7

3

)

Mailserver

(W

1

e server logfile (Linux)

0

(7

%

logfile (Linux)

C

in2

%

1

p

)

Mailserver

(W

1

e server logfile (Linux)

p

r esh

0

0 0

. 0

0

. 0 . 0

1 1

%

(7

1

%

(8 %

(4

6

5

1 8

8 9

)

1

0

0

, 0

0 0

1

, 7

0

0

, 8

4 0

6

6

, 8

4

9

2

5

. 4

4

)

1

0

0

, 0

0 0

1

, 8

8

7

, 7

8 0

6

9

, 2

1

0

2

7

. 2

7

)

1

0

0

, 0

0 0

4

, 0

1

6

, 0

0 9

3

1

. 1

5

1

2

8

, 9

2 2

)

I f the user has sp

ec ified

numb er of c luster c and id

a rather low ates w

ith v ery

sup low

p

ort threshold sup

p

v alue,

ort v alues,

there c ould

and

the c and

b e a large id

ate tab le

160

c o u

R

l d

c o n

v e c t o r i s

i s t o

b u

s u

t e c h

i l t ,

m

a a r a n

e

n

t h

V

a

i q

e

u

s i g n

e

a l g

d i

i f i c a n

c a n

a l s o

o r i t h

o u

m

m

a k

i c h

i s

l a t e r

e s

p

a n

t

e x

o f

m

t o

c l u

t r a

p a s s

d i d

a t e

t a b l e .

4

Simple Logfile Clustering Tool

I n

o r d

d

e v

L

e l o p

i n

u S

L

c a n

e d

x ,

h

T h

s p

a s h

h L

C

m

e

h

t h

x

a

d

a n

o p

a s h

i n

i f

a

f u

n

t a b l e

s ,

o n

i n

c o r r e s p

p

o n

h

t

h

a s h

a s

a

.

u t

d a t a ,

i t

c l u

p

w

c o n

a r e

n

d

e r

i l d

i n

a

s u p

t o

e . g

a v e

m

o r t

c l u

t h d

e

s e d

,

S

R

C

a s

e d

h

a n

u

s e s

m

. 0

n

t h

c y

e v

o

t i m e

o t

e n

[ 14

] .

f

e

t h

e s

c a n

f a s t o n

e

o v e -

s l o t

t h

i s

s

8

d m

e s ,

t a b l e

a c c e s s

T

e

b e e n a t

a t

e f f i c i e n

o r i t h

t h

r e v i o u

h

l a r y

t i m

a t a

t o

s .

a c c e s s

e

v e c t o r

p

t h

a s h

t h

L

a l g

e

o n

a r y t a b l e

i n

o o l )

s t r a t e d

h

d

T

m

a r y

t h

g

o c a b u

a t a

e

i s , i s

v

d

t h

t h h

a n

l y

u

d

s e d

e c t o r s .

r e s h i n

w

m

m

a t e

s e r t e d

s e d

o n

f o r

s t e r s

i n

s u i d

l a t f o r m

v

e m

c e

o r t a n

a r y t h

g

e d

s u m

i n

p

e a c h

T

a

u

I X

t o

a v o i d

s u m p

t i n

h

N

l o

] .

s

t h

c a n

a t e s

a r i l y U

e n

u

i s , e

s t e r i n

v e r y

p

[ 15

l u

e c t e d

i s

m

g

o r t s

s t e r s ,

i m

c t i o n

o r i t h

r e p

n

i l d

d i d

C

e r n

l e m

t h t h

d e s c r i b e d

p r i m

o d

i t h

b u

f i l e

l a r i e s

s

o r d

b u

a n

i m

m

d c a n

m

o g

b e e n

o s t

c t u r e

f u

a l g

f o r

a s

o c a b u

g

g

h m

c r i t i c a l

I n

i n

f i l e s

t o

o r d

i n

l o g

d

v

w

a s h

a l s o

d

L

a v o i d b e f o r e

o f

o r i t h

l e

t o –

a n

b e r

a l g p

e r

d a t a

u m

g

f o r

e

i d a t e s

e

n

i m

o n

s t r u

y

a r g i n

g

b u o f

h

m

s t r i n

l i s t

a n

e r

a n

o r d

d

t h

e

( S

C

l a r g

c t i o n

w

T

o r k

d a t a

m

e r

t h

t a b l e s

i t h t

C

w

a s h

d

s l o

e r a t i o n

a t

h

d

c e

s t e r i n

L

i n

a n

w

a n

g

o r

g

t

u

c l u S

r i t t e n i l e

t s

l l

a c c e p

s t e r i n t h

w

e f f i c i e n f u

a

f i l e

c a l l e d

p

e n

i s

d - X

i v e n

s

l o g

c o m

e r i m

s e

u n

c l u

p a t t e r n

p

h

a n

g

l d

i s

e

i f t - A

i s

e

t o o l

o v e - t o - f r o n

t a b l e

t a b l e

t h

b e e n

o u

E

b e c a u

y

t t a l

a s

s h

t a b l e

f

h

T

e t e c t e d

l i n

s e s

o

S

e n

T

a s h

b t

a s h

S

C i t

t a b l e ,

c r e a s e

f o r

d

h

e e d

e f f i c i e n

L

g h

a s h

e

e n

e r i m

t a b l e .

h

t h

e

i n

u

t

e n

h

T

p l e m p

S

o u

a t e

t o - f r o n w

.

t h

C

d i d

i m e x

r e d

I n

c a n

o v

c a n

t o

t o

o r y .

s t e r

a t e s ,

a n

s e d

e m

d i d

,

u

n

l i e d

c a n

e r

h

a m a p

f o r

s e c t i o n

w

t

b e

o l d a

a s

c o n

i n

p

c i s e

u w

t ,

a n

a y

b

d

a f t e r

y

p

r i n

i t

t i n

h

g

a s o u

t

. ,

Dec 18 * myhost.mydomain * connect from Support: 570 Dec 18 * myhost.mydomain * log: Connection from * port Support: 570 Dec 18 * myhost.mydomain * log: Support: 679 T c a n F

h

o r

t h

t h

e

t h

s p

o r e

c l o s e l y ,

d

e

u

o

f

e o u

a l

o r i t h

a l g

o r i t h

s u m

s m

a n

t h

p

s

l i k

a n

o r t

r e q

d

u i r e e

l i n

d

m

C

L

t h I Q

T

1,

c h

t h

U

g w

E

w

h

f l a g

s

t h

w

h

p

a t t e r n

c e

a l l

o f

t h

e n

e r a l b

a t

o i n

i c h

t d

e t h

o

m n

u

s t o t

m d

t h b e f o l l o

i n t h

e r

a m

p

t h

o r e

s p

e c i f i c

C

c a n

1,

. . . , C

C

1,

a l w

o l d .

a r t

o f t h

i s

e

o n

p

t o

l t h e

r e q

c l u

r e p o u

h

e n

e a c h d

p

a r e d

t o

a l s o o r e

t r a d o n

l y ,

t h

m

n

t ,

i n

o r d

d

s u

s i d

w

h

e r

a r e t o

t h

p

p

t o

c l u e n

e

o r t

e r e d

e

c l u

i s

a t c h

f o r

e

a l

e r e

t a b l e

a l s o

o n

e v e n i t i o n

e

a t t e r n

t h

a n

s t e r

t a b l e .

p

c o n

t h ,

t h

d

f o u

e d

c l u

a t e

i n

a t t e r n

s

a d

i d

s e c o n

o r t e d

g

s t e r

u i r e m

d

m

e c t c a n

i d a t e s e

a r e k

g

d

a t t e r n a r e

k

s p e

t h

s e c o n

a y s A

c a n l e ,

. . . , C

b e l o n

a r e

C

o t h

a t c h

s

w

t o i n

m

r e s h p

L

e x

e

a t t e r n

T

s t e r s

e

d i d a t e s

l i n

S

c l u

a r e

a b o v

i d a t e s

c a n

e

f o r

e r e

e a t

g

a

p

w

t h

t i n

t o

f o r c e s

t h

t h

e s

a y ,

l i n

e l o

e r

I n

c a n

g

a t

s e a r c h

e t h

e

w

t h

e

s . l i n

t h

b e l o n

I n

p

e

r e p r e s e n k

e r e

e v e r y

l i n

e c k

e s

a t

C.

d

s t a r t s

e

s i n

. . . , C

o r e

a t

,

v a l u

e s

e s

a n i t

l i n

i r d

o r t

i d a t e

v a l u

C

C

p

a l l

c a n

s l y , p

p

L

t h

a t e s

s u

d

e

e

m

e f o r e

e c i f i c

t h

d i d e

c o m b

S

s p

a n

t h

a

C,

o r e

c a n

C,

t o

a l g

m t h

C,

g l t a n

i n

I f

a t e

e c i f y

i d a t e

t

e c i f i c

i r d .

d i d

v a l u

o r i g

c a n

m c a n

s p

b e l o n s i m

s e r

r e p r e s e n

o r e

c a n

u a t e

e a c h

a t

m

e

d i d

s t e r t h

e i r

s t e r i n

g

s e v e r a l a c h

i e v e

A Clustering Algorithm for Logfile Data Sets

c lustering results that are more c omp B d

y

d

efault,

etec ted

SLCT

c lusters.

is no c onc ise d p

attern.

b ut this c ould

hic h tend

d

mak

high sup

y

d

too ex

N

p

oes not c orresp outlier p

v alue has b een sp many

outliers.

user has sp

T

ec ified

ond

oint c ould

ec ified

b

ossib ly

rep

y

outlier p

line

b e stored

c ost - esp

herefore,

there

to any

etec ted

d

oints,

,

to

ec ially

the end

SLCT

a c ertain c ommand

hen there are many

1

of the

ec ial c luster,

ata after c lusters hav e b een d

ote that w

to the file of outliers ( and

to form a sp

etec ted

6

] .

o not b elong to any

ensiv e in terms of memory

ort threshold

I f the end

ered

sinc e it d

eac h d

c lusters and

ass ov er the d

oints to a file.

p

user [ 7

oints that d

oints are c onsid

ata set,

ay

p

efault.

es another p

all outlier p SLCT

b e w

s to c reate few

isc ov er outliers b

SLCT

ort the p

tion for this c luster,

p roc esses the d

hen a relativ ely

w

rehensib le to the end

oes not rep

hough outlier p

esc rip

As SLCT

memory , w

T

d

1

user,

oes not

line flag, and

w

rites

one c an ap

eat this p roc ess iterativ ely

for ev

ery

p ly new

outlier file) . W

e hav e mad

e many

ex

for b uild ing logfile mod p

p

eriments w

els and

d

resents the results of some our ex c onsump

w

ork all d w

tion

station w

of

SLCT

ith 2

.

M

B

ata c lustering task

s,

as also instruc ted

5 6

T

he

ith SLCT

ex

p

to id

ord

p

eriments and

summary

entify

and

it has p rov ed

to b

e a useful tool

atterns from logfiles.

eriments for measuring the runtime and

of memory a w

,

etec ting interesting p

outlier p

R

w

ere

ed

c ond

hat 8 . 0

uc ted

v ec tor of siz e 5 oints,

on

Linux

1

as op

0

0

0

w

, 5

G

H

T

ab le 4

memory

z

P

entium4

erating sy stem.

as used

four p asses ov er the d

.

F

or

Sinc e SLCT

ata w

ere mad

e

altogether.

Table 4. Runtime and memory consumption of SLCT

Data set

S

u

th

p

p

o

r esh

r t o

l d

# d

c l u Mailserver

logfile (Linux)

1

0

Mailserver

logfile (Linux)

5

%

Mailserver

logfile (Linux)

1

%

Mailserver

logfile (Linux)

0

ac h

e server logfile (Linux)

1

0

C

ac h

e server logfile (Linux)

5

%

C

ac h

e server logfile (Linux)

1

%

C

ac h

e server logfile (Linux)

0

A

ut h ent ic at ion server logfile

. 5

C

(W

in2

A

0

0

0

A

in2

0

0

A

0

in2

0

T

0

0

0

1

4

(3

8

2

, 8

5

7

)

(7

6

, 5

7

1

)

(3

0

5

%

1

%

0

. 5

0

0

-2

8

5

8

, 9

7

8

(4

0

9

, 4

8

9

)

(8

1

, 8

9

7

)

%

(4

%

0

, 9

4

1

)

1

8

9

, 1

8

8

(2

4

4

, 5

9

4

)

(4

8

, 9

1

8

)

1

)

0

%

8

9

tl i er

M

ts

1 3

8 1

8

4

8

c o

1

2

, 1

6 6

, 1

6 6

4

, 3 1

em n

2 9

0

2

su

o m

R

r y p

ti o

u

n

ti m

e

n

1

3

0

1

1

1

3 1

, 2

5

5 5

3

B

2

9

B

K

in 5

0

1

m

1

sec 5

sec

in 3

m

1

B

m

0

sec 4

in 3

0

1

K

6

m

1

sec 8

in 5

0

B

K

1

5

K

0

m

1

sec 7

in 3

7

B

8

8

3 6

K

6

m

B

7

in 1

7

K

2

6

2 8

2

5

m

B

0

in 1

7

K

6

m

B

0

2

7

K

8

4

B

2

7

5

K

7

2

6

2

3

6

, 3

5

1

0

3

3

1

4

6

2

%

(2

4

, 4

5

9

)

5

sec

in 5

m

in 1

sec

6 6

sec

1

sec

3

4

1

, 2

5

6

5

1

1

2

K

B

1

1

m

in 3

8

sec

4

6

1

, 2

5

6

7

3

4

8

K

B

1

1

m

in 5

8

sec

3

, 3

8

9

3

2

K

B

1

1

m

in 5

4

sec

6

that our algorithm has mod

c lusters from large logfiles in a relativ ely

ered

3 0

1

)

8

7 2

1

)

(4

u i n

5

8

1

)

some tests w

1

, 2

o o

ster s

)

(8

ith CLI Q

U

E

algorithm,

algorithms in terms of runtime.

low

8

f p

)

he results show

many

, 7

o

)

ut h ent ic at ion server logfile (W

5

# f

)

0

in2

6

%

. 5

ut h ent ic at ion server logfile (W

(7

%

1

ut h ent ic at ion server logfile (W

%

o

etec ted

) , ,

our algorithm w the d

as 5

-1

ifferenc e inc reased

E

in ord

req

er to measure the d

v en for med 0

est memory

ium sup

times faster. ev en further.

uirements,

short amount of time.

p

find

s

e also mad

and

e

ifferenc e of the tw

ort threshold

As the sup p

W

o

v alues ( suc h as

ort threshold

v alue w

as

162

5 F

R

i s t o

V

a a r a n

Future work and availability information

o r

a

f u

t u r e

w

o r k

t o

c r e a t e

a n

f i t

i n

c e r t a i n

i s

t o

a

a l g

a v a i l a b l e

R

d i

a t

,

w

e

o r i t h t i m

h

p

t o

f o r

e

t t p

l a n

m w

: / / k

d

i n

d

u

. n

o d

i n

v e s t i g

e t e c t i n

o w

.

S

g

L

a t e

p

C

T

e t i . e e / ~

v a r i o u

a t t e r n i s

d

s

s

t h

a s s o c i a t i o n

a t

i s t r i b u

s p

a n

t e d

u

r u l e

o v e r

n

d

e r

m

t h

u

e

a l g

o r i t h

l t i p l e

t e r m

s

m

l o g

o f

s ,

f i l e

G

N

i n

o r d

l i n

U

G

e s

P

e r

a n

L

,

d

a n

d

r i s t o / s l c t / .

eferenc es

1.

S

t e p h W 15

2

.

.

.

6

S

.

u

d

V

P

k

C

f e r e n e s h

C .

10

w

D . R

i n g

a t a

E

e

e n

o f

D

. J u

.

A

F

un d

i n s . Aut o

E

c e

R

N

I X

A

a n .

d

e p

r f / , e n

m

th

7

d

l o g

19 e n

9

t

o

e r k

h i n

a s t o g i ,

a n

a t e d

S

S

y s t e m

y s t e m

A

d

m

M

i n

o

n i t o

r i n

g

C

o n

i s t r a t i o n

a n

d

N

f e r e n

o

t i f i c a

c e ,

p

p

t i o

. 14

n 5

-

s ur f e r ( 1

)

a

n

d

l o

g s ur f e r . c o

n

f ( 4

)

m

a

n

ua

l

p a

g

e s .

5 . T

o

o l

S g e

f o r

K

y u

L

o

c a l

e ,

E

S

I G

o f

M

O

0 2

v e n

t

a n

H

h

i m

. R

2

5 ( 5 ) ,

s

R

d

D

D

i g

D

2

S

a g h

h

I n

a t a

i m

0

u

R

g s

o f

M

C

o

r r e l a

t i o

n . A

M

i n

c t a

C

y b

e r n

e t i c a

i m

e n

t e r n a t i o n

O

A R 4

5

p o p

l

D

a n

A

i n g

- 8

T

C

c e

0 0

e c h

n

i q ue s .

D

9

M

Al g

I n

t e r n

o r i t h

m

U

K

S D



C

D

l us t e r i n

g

a t i o n

a l

9 .

P

r a b h

a t a

o n

T

I G

, 19

g

.

AC S

3

l us t e r i n

0

a n d

f o r

f e r e n

C

2

M

l o s ,

a t a

, .

C

p . 7 3 u

b us t 6 6

n

th

5

o

- 3

a k r i s h

n

o n

:

e

g , u

C

K p . 3

t h

n a

a l

C p

a m

G

s i o

a t a

.

i n i n

i t r i o s

D

D

l ,

r o c e e d i n

e h r k e , g

g

t m

y s t e m

a n d

r i e s . P

G

l us t e r i n

s e o k S

i s c o v e r y

e s

M

d

r k

a

l us t e r i n C

C

s u r v e y .h

a t i o n

e h m

D n

a n

G

um

f

0 2

f o r m

n e s

a n

C

e

l l e r m

ur v e y / b

l e d

a l ,

g s

C

r o

M

e c i l i a

a r s h

n a l

o f

P

j e c t e d

C

a n a g e m

H

U

J . B

n

a

g

e n N

r o c o p

V

t

o f

D

e r y

a n

L

a

19

a n

a k r i s h

R 2 0

a m

th

I n

c ,

9 9

P

p

p . 6

d

r g

J o e l

g .

a t a ,

,

i v e r s i t y , d

i u

l us t e r i n

a g e s h

f o r

t h e

S

e i

H t i o p Z

.

R

c e d

O

p

D

L

1- 7

l o k a

. W

o l f ,

r o c e e d

A

e

M

a k a r

i n i n g

a n a g e m

R

Ap

e n

t

a g h

a v a n .

p l i c a

o f

D

t i o n

a t a ,

s .

p p

.

2 , C

t a

S

i n 19

h

P

g s 9

o u

i l i p

S

t h

. Y

e

A

u ,

C

a n

d

S

I G

M

J o n M

g

S

O

o o

D

P

I n

a r k

t e r n

. F

a

s t

a t i o n

a l

9 .

d

e t s . T

h

o f

h

a r y .

e c h

n

M

AF

i c a l

R

I A:

e p

E

o r t

N

f f i c i e n

o . C

P

D

t

a

C

n

- T

d

S

R

- 9

c a l a

9

0

6

- 0

b l e 10

,

. n a n

t e r n

a t i o n

o

a k e s h

A

S

a m

P

p

V

t a b

S

r i k a n t . F

a l

C

o n

- 19 7

,

a

s t

f e r e n

a n

d

Al g

c e

o n

o

r i t h V

m

s

e r y

f o

L

r

M

i n

a r g e

D

i n

g

a t a

As s o B

c i a t i o

a s e s ,

p

n

R

p . 4 8

ul e s .

7

- 4

9

9 ,

t h

M C

i w e

i n o n

e n A

a n

d

D

i m

r o c e e d i n

i t r i o s

g s

o f

G

t h

u

e

n

o p

th

15

u

l o s . C

o

I n

t e r n

a t i o n

n s t r a

i n

a l

t - B

C

a s e d

o n

R

f e r e n

ul e

c e

o n

9 .

t l y a l

Y

o f

a l ,

s e s . P

19 9

f f i c i e n

g s

g r a w a

t e r n a t i o n

C

i n

g

L

f e r e n Y

i n

M

S

. I G

o n

c e

g

M

i n

M

P

o n

a

M

t t e r n a n

i n g

O

D

F I n

s

f r o

a g e m r e q

t e r n

m

D

e n t

ue n

t

a t i o n

a t a

o f P

a l

D a

b

t t e r n

C

a s e s . P

a t a ,

o n

p s

p . 8 w

f e r e n

i t h c e

r o c e e d

5

- 9

3

,

o ut

o n

C

M

i n

19

g s

9

o f

a n

8 . d

i d

a n a g e m

a

t e

e n

t

0 .

t e f f e n e x t

r o c e e d p

0

a

8

e i ,

2 0

a k r i s h P

A

T

D

J r . E I n

r o c e e d i n

e l ,

R s e

p . 18

D

p . 1- 12 ,

s .

e n

J i a n

n . P

o b

J r ., D

g ,

M

ul a t i n g

c t i o n

o

e ,

a y a r d

I G

a n ,

a t a ,

v a n

a r g

J . B

e r a

V

a y a r d L

g i n e e r i n

M

s t i n

. M

I n

s i n g

l us t e r i n

i n

C

Ac c um 15

S

2 .

J o h

P

g r a w

e r t o A

. J i a w G

E

ut e s . I n

o w

t h

o n

C

i n

n

s p a

o i l ,

A

0

J o h

a l ,

f o r

c e

e r t o

i n

o b t h

t k

U

4 .

o b M

A

e

g / l o g s u

S

U

K

e s t e r n

e s h

2 0

t i ,

g g a r w

c e

r o c e e d 9

d

t h

8 .

s

G

e

t f o r m

,

t a

o f

9

w

a j e e v

a

ub

g s

U

At t r i b

o n

. A

s p a

a k

o d

o f

e / e n

i n .

a n

S

m

ub

o r t h

11. R

14

C

o r i t h

3

R

g r a w

19

f e r e n

19

13

,

j a y

P

12

5

. T

g s

j .n e c .c o m

D

c e

d

l a

2

a ,

G

i n

o n

. R

h

a t i c

a r u

N

- 7

r i c a l

A

a n

S S

m

- 10 h

u o

r i c a l

a k

Al g

9

G

o

E

i n

f n .d

e r k h

a t e s h

t e g

4

a n

d i . P

0 5 B

r o c e e d

9 .

. 7

o n R

e y

.c e r t .d

a a r a n p p

a t e g

Aut o

8

L w

i p t o

e n a

C .

V

C

a n d

r o c e e d

: / / c i t e s e e r .n

r

C

7

g w

a v e l

f o

s e n

. P

3 .

: / / w

t t p

a n

t c h

9

( 4 ) ,

P

. H

a

19

i s t o

h 5

E

w

o l f g a n

R

.

,

t t p

15 4

S

5

W h

3

e n

i t h

o

n a i n

l i c a t i o n

H

e i n

c a b

ul a

a n d g s s ,

p

z ,

a n d

H

r i e s . I n J u

s t i n

o f

t h

p

. 2 15

e

Z

th

5 - 2

2

u

g h

f o r m

4 ,

E

a t i o n

o b

e l .

I n

t e r n a t i o n

19

9

7 .

P

. P

W

i l l i a m

r o c e s s i n

e r f o

r m a l

a C

n c e o n

s . g

L

I n - m

e m

e t t e r s , i n

f e r e n

P c e

r a

8 0

o

r y

c t i c e o n

H

( 6 ) ,

D

p o

a t a b

a s h

f

p . 2

T 7

1- 2

S t r i n a s e

S

g

a b

l e s

7 7

,

H

y s t e m

f o

2

0 0

a s h i n s

r 1. g

f o r