UAI2001
KOCKA
&
CASTELO
269
Improved Learning of Bayesian Networks
Tomas Kocka
Robert Castelo
Laboratory for Intelligent Sy stems
Institute of Information & Computing Sciences
Univ. of Economics Prague
University of Utrecht
Czech Republic
The Netherlands
kockat!lvse.cz
robertot!lcs.uu.nl
Abstract
ponents:
(1)
a search space and a traversal operator
that by means of local transformations of the struc ture defines a set of neighbours, (2) a scoring metric
The search space of Bayesian Network struc
evaluating the quality of a given structure and (3) a
tures is usually defined as Acyclic Directed Graphs (DAGs) and the search is done by lo
search strategy.
cal transformations of DAGs. But the space
The basic problem is the choice of the search space,
of Bayesian Networks is ordered with respect
because there exists different DAGs which assert the
to inclusion and it is natural to consider that
same set of independence assumptions among the vari
a good search policy should take this into ac
ables in the domain - we call such networks equivalent.
count. The first attempt to do this (Chick ering
1996)
So the problem is whether to search the space of all
was using equivalence classes of
DAGs or the space of all equivalence classes of DAGs
DAGs instead of DAGs itself. This approach
(essential graphs or another representation).
produces better results but it is significantly slower.
The advantage of the space of all equivalence classes
We present a compromise between
is that it is smaller than the space of all DAGs. On
these two approaches. It uses DAGs to search
the other hand it is not possible to compare the scores
the space in such a way that the ordering by inclusion is taken into account.
of two essential graphs by local computations derived
This is
directly from their graph structure. One should first
achieved by repetitive usage of local moves
transform those essential graphs into DAGs in order to
within each equivalence class of DAGs. We
do so. Thus, it becomes difficult to define the traversal
show that this new approach produces bet
operator as it needs to use non-local algorithms con
ter results than the original DAGs approach
verting DAG to essential graph and essential graph to
without substantial change in time complex
DAG which makes it computationally expensive. For
ity. We present empirical results, within the
DAGs the simple alternative of add, remove and re
framework of heuristic search and Markov
verse an arc is often used as the traversal operator. It
Chain Monte Carlo, provided through the
was shown in (Chickering
Alarm dataset.
1996)
that the learning via
equivalence classes produces better results than the DAGs approach but it needs significantly more time.
1
Introduction
A Bayesian Network
{ x�, .. , Xn} .
In the context of the Markov Chain Monte Carlo
G
for a set of variables
V
(MCMC) method, Madigan et al. =
represents a joint probability distribution
over those variables. It consists of (I) a network struc ture that encodes assertions of conditional indepen
(1996)
show that
in order to build an irreducible Markov chain over the space of essential graphs it is necessary to design a traversal operator that modifies two edges at once.
dence in the distribution and (II) a set of local condi
In this paper we argue that the better results presented
tional probability distributions corresponding to that
by (Chickering
structure. The network structure is an acyclic directed
equivalence classes only but by the combination of the
graph (DAG) such that each node corresponds to one variable in V.
space of equivalence classes and the traversal operator
There are many methods for learning Bayesian Net works from data. They usually consist of three com-
1996)
are not caused by the usage of
they used. Moreover, recent results of (Gillispie 2001) and (Steinsky 2000) suggest that the space of all equiv alence classes is only cca
3.7
times smaller than the
KOCKA & CASTELO
270
space of all DAGs. So the use of equivalence classes does not provide a substantial reduction with respect of the size of the search space of DAGs. We show that the space of DAGs (which is computationally cheaper to use) can produce similar results and much faster. There are two major contributions of this paper. First, we introduce a new concept of traversal operator for the DAGs space. This concept is based upon the (Chickering 1995) transformational characterization of equivalent DAGs. It uses transformations among equivalent DAGs instead of equivalence classes and so all necessary operations are local. It is sensible to expect that learning algorithms that consider the in clusion among Bayesian Networks will perform better than those that do not. The approach we use is based upon the result of (Kocka et al. 2001) characterizing the inclusion of DAGs t hat differ in at most one adja cency. It is enough for us as long as we want to use only local changes to the structures, but the general inclusion is still an open problem. The second contribution of this paper is the implemen tation of the previous idea within the frameworks of heuristic learning and the MCMC method. Our ex periments show that this approach produces better re sults than the standard operator for the DAGs space without substantial change in time complexity. The MCMC implementation not only will be an improve ment by itself but it will help understanding why the approach achieves such an improvement. The experi ments have been carried out using the standard bench mark dataset of the Alarm network used previously by Cooper and Herskovits (1992) and in many other works within the subject. In the next section we introduce the basic concepts of DAGs, their equivalence classes, their inclusion and a brief comment about the sizes of DAG and essential graph spaces. In the section 3 we formalize the dif ferent concepts of neighbourhood, and provide their implementation in the framework of heuristic search and MCMC. These neighbourhoods, within the par ticular implementation we provide, will be compared in section 4 using the well known benchmark Alarm dataset. We end with concluding remarks in section 5. 2
Basic concepts
In this section notation and previous relevant work to this paper are introduced. Lower case letters are used to denote elements of V while capital letters are used to denote subsets of V. Possibly indexed capital letters L, G, H will be used to denote DAGs over V. We use E(G) for the underlying (undirected) skeleton of the DAGG.
UAl2001
The symbol (A, Bl C) denotes a triplet of pairwise dis joint subsets A, B, C of V. The symbol T(V) will de note the class of all disjoint triplets over V; {{A, BIG); A,B,C 5
!tCAitN!t
score
KIJARR
:'Jtruct diff
KCAllNK
29
RCARR 29
-11480.47
-11480. 47
0.40
-ll4U1.52±15.12
-11470.46±15.79
18.90±3.06
16.50±2.00
0.44
-11484.69±19.29
- 114 7 3.41± 14 .18
18.00±3.28
16.30±2.18
0.27
7
54
0.43
-11469.00±07.88
-11470.75±14.67
15.50±1.55
15.60±1.88
10
53
O.H
-11470.43±15.U4
-1!464.03± 11.00
14.80±2.15
15.20±1.78
0
60
0.28
·lLL15.l3
·llllii.l3
28
28
2
58
-11113.50±19.07
-11105.10±20.62
18.10±2.77
!4.70±3.87
-11121.49±24.64
-11090.15±08.51
17.90±4.41
11091;. 1 9±1 2.38 -1109a.87±tt.25
-11083.13±05.07
13.40±2.55
10.00±1.47
-11094.17±18.72
12.40±2.05
11.50±2.11
4
:.6
0.40 0.42
7
53
0.42
10
:.3
0.43
-
1Ll0±2.35
0
62
0.61
-11530.80
-11530.80
2
60
0.41
37 18.20±2.23
37 15.90±3.10
59
0.43
-11151.58±14.23 -11438.3 1 ±08.0 1
-1!453.70±15.75
4
-11436.65±07.46
1 4.90±2.95
13.40±1.94
7
66
0.67
-11431.02±06.23
-11427.84±03.62
ll.80±l.61
10.80±0.81
10
53
0.86
-11440.88±1 o. 78
-11428.89±08.95
13.70±2.13
11.00±1.17
0
69
-55249.43
-55249.43
46
46
-55072.99±67.61
-M993.41±08.70
-oo05J.93±40.93
-54992.40±10.92
11.60±6.70 7.90±2.12
5.20±1.38
7
56
1.54 2. 50 2.09 2.14
-5S\J24.S3±4S-12
-54989.70±10.12
7.10±2.17
4.00±1.37
10
56
2.08
-55025.19±43.49
-54985.99±06.68
6.10±1.95
5.10±2.49
0
57
0.92
33
57
2.02
-54732.19 -54679.46±34.06
-54732
2
-54641.27±60.60
12.50±6.25
6.10±4.08
4 7
56 53
1.35
-!i4t310.S-2±l.J.24
-54607.60±11.34
.5.20±3.93
1.29
-54()11.8.5±25.63
4.50±1..):1
10
52
1.28
-54602.98± 10.8!}
-04602.77±10.93 -54606.47± 12.48
3.80±1.38 3.70±1.31
4.00±1.26
4.10±1.32
0 2
2
66
4
57
7.20±2.23
33
59
0.88
-54454.16
-54454.16
63 59
1.19
-54340.02± 16.48
-54335.27±32.25
36 10.20±3.97
36 8.00±2.1 8
4
no
-54335.49±1 9.99
-54326 25± 11.15
8.60±1.91
8.40±2.05
7
o5
1.25
-54331.19±12.09
-5431 5 . 06±07.33
7.70±1.35
-54329.40±11.13
8.50± 1 .55 10.00±2.18
-108697.78
21
108463.65±46 . 17
4.00±2.20
5.40±4.10
6.80±2.25
1.60,j,Q,90
10
55
1.28
-54363.17±52.63
0
56
-108697.78
2
56
1.86 2 .23
54
2.28
4 7 10
4.2
performance steps. sec •I
-lOS