Improved Learning of Bayesian Networks V =

Report 4 Downloads 96 Views
UAI2001

KOCKA

&

CASTELO

269

Improved Learning of Bayesian Networks

Tomas Kocka

Robert Castelo

Laboratory for Intelligent Sy stems

Institute of Information & Computing Sciences

Univ. of Economics Prague

University of Utrecht

Czech Republic

The Netherlands

kockat!lvse.cz

robertot!lcs.uu.nl

Abstract

ponents:

(1)

a search space and a traversal operator

that by means of local transformations of the struc­ ture defines a set of neighbours, (2) a scoring metric

The search space of Bayesian Network struc­

evaluating the quality of a given structure and (3) a

tures is usually defined as Acyclic Directed Graphs (DAGs) and the search is done by lo­

search strategy.

cal transformations of DAGs. But the space

The basic problem is the choice of the search space,

of Bayesian Networks is ordered with respect

because there exists different DAGs which assert the

to inclusion and it is natural to consider that

same set of independence assumptions among the vari­

a good search policy should take this into ac­

ables in the domain - we call such networks equivalent.

count. The first attempt to do this (Chick­ ering

1996)

So the problem is whether to search the space of all

was using equivalence classes of

DAGs or the space of all equivalence classes of DAGs

DAGs instead of DAGs itself. This approach

(essential graphs or another representation).

produces better results but it is significantly slower.

The advantage of the space of all equivalence classes

We present a compromise between

is that it is smaller than the space of all DAGs. On

these two approaches. It uses DAGs to search

the other hand it is not possible to compare the scores

the space in such a way that the ordering by inclusion is taken into account.

of two essential graphs by local computations derived

This is

directly from their graph structure. One should first

achieved by repetitive usage of local moves

transform those essential graphs into DAGs in order to

within each equivalence class of DAGs. We

do so. Thus, it becomes difficult to define the traversal

show that this new approach produces bet­

operator as it needs to use non-local algorithms con­

ter results than the original DAGs approach

verting DAG to essential graph and essential graph to

without substantial change in time complex­

DAG which makes it computationally expensive. For

ity. We present empirical results, within the

DAGs the simple alternative of add, remove and re­

framework of heuristic search and Markov

verse an arc is often used as the traversal operator. It

Chain Monte Carlo, provided through the

was shown in (Chickering

Alarm dataset.

1996)

that the learning via

equivalence classes produces better results than the DAGs approach but it needs significantly more time.

1

Introduction

A Bayesian Network

{ x�, .. , Xn} .

In the context of the Markov Chain Monte Carlo

G

for a set of variables

V

(MCMC) method, Madigan et al. =

represents a joint probability distribution

over those variables. It consists of (I) a network struc­ ture that encodes assertions of conditional indepen­

(1996)

show that

in order to build an irreducible Markov chain over the space of essential graphs it is necessary to design a traversal operator that modifies two edges at once.

dence in the distribution and (II) a set of local condi­

In this paper we argue that the better results presented

tional probability distributions corresponding to that

by (Chickering

structure. The network structure is an acyclic directed

equivalence classes only but by the combination of the

graph (DAG) such that each node corresponds to one variable in V.

space of equivalence classes and the traversal operator

There are many methods for learning Bayesian Net­ works from data. They usually consist of three com-

1996)

are not caused by the usage of

they used. Moreover, recent results of (Gillispie 2001) and (Steinsky 2000) suggest that the space of all equiv­ alence classes is only cca

3.7

times smaller than the

KOCKA & CASTELO

270

space of all DAGs. So the use of equivalence classes does not provide a substantial reduction with respect of the size of the search space of DAGs. We show that the space of DAGs (which is computationally cheaper to use) can produce similar results and much faster. There are two major contributions of this paper. First, we introduce a new concept of traversal operator for the DAGs space. This concept is based upon the (Chickering 1995) transformational characterization of equivalent DAGs. It uses transformations among equivalent DAGs instead of equivalence classes and so all necessary operations are local. It is sensible to expect that learning algorithms that consider the in­ clusion among Bayesian Networks will perform better than those that do not. The approach we use is based upon the result of (Kocka et al. 2001) characterizing the inclusion of DAGs t hat differ in at most one adja­ cency. It is enough for us as long as we want to use only local changes to the structures, but the general inclusion is still an open problem. The second contribution of this paper is the implemen­ tation of the previous idea within the frameworks of heuristic learning and the MCMC method. Our ex­ periments show that this approach produces better re­ sults than the standard operator for the DAGs space without substantial change in time complexity. The MCMC implementation not only will be an improve­ ment by itself but it will help understanding why the approach achieves such an improvement. The experi­ ments have been carried out using the standard bench­ mark dataset of the Alarm network used previously by Cooper and Herskovits (1992) and in many other works within the subject. In the next section we introduce the basic concepts of DAGs, their equivalence classes, their inclusion and a brief comment about the sizes of DAG and essential graph spaces. In the section 3 we formalize the dif­ ferent concepts of neighbourhood, and provide their implementation in the framework of heuristic search and MCMC. These neighbourhoods, within the par­ ticular implementation we provide, will be compared in section 4 using the well known benchmark Alarm dataset. We end with concluding remarks in section 5. 2

Basic concepts

In this section notation and previous relevant work to this paper are introduced. Lower case letters are used to denote elements of V while capital letters are used to denote subsets of V. Possibly indexed capital letters L, G, H will be used to denote DAGs over V. We use E(G) for the underlying (undirected) skeleton of the DAGG.

UAl2001

The symbol (A, Bl C) denotes a triplet of pairwise dis­ joint subsets A, B, C of V. The symbol T(V) will de­ note the class of all disjoint triplets over V; {{A, BIG); A,B,C 5

!tCAitN!t

score

KIJARR

:'Jtruct diff

KCAllNK

29

RCARR 29

-11480.47

-11480. 47

0.40

-ll4U1.52±15.12

-11470.46±15.79

18.90±3.06

16.50±2.00

0.44

-11484.69±19.29

- 114 7 3.41± 14 .18

18.00±3.28

16.30±2.18

0.27

7

54

0.43

-11469.00±07.88

-11470.75±14.67

15.50±1.55

15.60±1.88

10

53

O.H

-11470.43±15.U4

-1!464.03± 11.00

14.80±2.15

15.20±1.78

0

60

0.28

·lLL15.l3

·llllii.l3

28

28

2

58

-11113.50±19.07

-11105.10±20.62

18.10±2.77

!4.70±3.87

-11121.49±24.64

-11090.15±08.51

17.90±4.41

11091;. 1 9±1 2.38 -1109a.87±tt.25

-11083.13±05.07

13.40±2.55

10.00±1.47

-11094.17±18.72

12.40±2.05

11.50±2.11

4

:.6

0.40 0.42

7

53

0.42

10

:.3

0.43

-

1Ll0±2.35

0

62

0.61

-11530.80

-11530.80

2

60

0.41

37 18.20±2.23

37 15.90±3.10

59

0.43

-11151.58±14.23 -11438.3 1 ±08.0 1

-1!453.70±15.75

4

-11436.65±07.46

1 4.90±2.95

13.40±1.94

7

66

0.67

-11431.02±06.23

-11427.84±03.62

ll.80±l.61

10.80±0.81

10

53

0.86

-11440.88±1 o. 78

-11428.89±08.95

13.70±2.13

11.00±1.17

0

69

-55249.43

-55249.43

46

46

-55072.99±67.61

-M993.41±08.70

-oo05J.93±40.93

-54992.40±10.92

11.60±6.70 7.90±2.12

5.20±1.38

7

56

1.54 2. 50 2.09 2.14

-5S\J24.S3±4S-12

-54989.70±10.12

7.10±2.17

4.00±1.37

10

56

2.08

-55025.19±43.49

-54985.99±06.68

6.10±1.95

5.10±2.49

0

57

0.92

33

57

2.02

-54732.19 -54679.46±34.06

-54732

2

-54641.27±60.60

12.50±6.25

6.10±4.08

4 7

56 53

1.35

-!i4t310.S-2±l.J.24

-54607.60±11.34

.5.20±3.93

1.29

-54()11.8.5±25.63

4.50±1..):1

10

52

1.28

-54602.98± 10.8!}

-04602.77±10.93 -54606.47± 12.48

3.80±1.38 3.70±1.31

4.00±1.26

4.10±1.32

0 2

2

66

4

57

7.20±2.23

33

59

0.88

-54454.16

-54454.16

63 59

1.19

-54340.02± 16.48

-54335.27±32.25

36 10.20±3.97

36 8.00±2.1 8

4

no

-54335.49±1 9.99

-54326 25± 11.15

8.60±1.91

8.40±2.05

7

o5

1.25

-54331.19±12.09

-5431 5 . 06±07.33

7.70±1.35

-54329.40±11.13

8.50± 1 .55 10.00±2.18

-108697.78

21

108463.65±46 . 17

4.00±2.20

5.40±4.10

6.80±2.25

1.60,j,Q,90

10

55

1.28

-54363.17±52.63

0

56

-108697.78

2

56

1.86 2 .23

54

2.28

4 7 10

4.2

performance steps. sec •I

-lOS