Symbolic vs. connectionist learning - Semantic Scholar

Report 3 Downloads 133 Views
176

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

VOL. 13,

NO. 2,

MARCH/APRIL 2001

Symbolic vs. Connectionist Learning: An Experimental Comparison in a Structured Domain Pasquale Foggia, Roberto Genna, and Mario Vento AbstractÐDuring the last two decades, the attempts of finding effective and efficient solutions to the problem of learning any kind of structured information have been splitting the scientific community. A ªholy warº has been fought between the advocates of a symbolic approach to learning and the advocates of a connectionist approach. One of the most repeated claims of the symbolic party has been that symbolic methods are able to cope with structured information while connectionist ones are not. However, in the last few years, the possibility of employing connectionist methods for structured data has been widely investigated and several approaches have been proposed. Does this mean that the connectionist partisans are about to win the ultimate battle? Is connectionism the ªOne True Approachº to knowledge learning? The paper discusses this topic and gives an experimental answer to these questions. In details, first, a novel algorithm for learning structured descriptions, ascribable to the category of symbolic techniques, is proposed. It faces the problem directly in the space of graphs by defining the proper inference operators, as graph generalization and graph specialization, and obtains general and consistent prototypes with a low computational cost with respect to other symbolic learning systems. Successively, the proposed algorithm is compared with a recent connectionist method for learning structured data [17] with reference to a problem of handwritten character recognition from a standard database publicly available on the Web. Finally, after a discussion highlighting pros and cons of symbolic and connectionist approaches, some conclusions, quantitatively supported by the experimental data, are drawn. The orthogonality of the two approaches strongly suggests their combination in a multiclassifier system so as to retain the strengths of both of them, while overcoming their weaknesses. The results on the experimental case-study demonstrated that the adoption of a parallel combination scheme of the two algorithms could improve the recognition performance of about 10 percent. A truce or an alliance between the symbolic and the connectionist worlds? Index TermsÐSymbolic learning, connectionist systems, structural description, attributed relational graph, prototype learning, machine learning.

æ 1

INTRODUCTION

M

ANY scientific areas are populated by applications dealing with structured information, i.e., complex information which can be seen as made of simpler parts suitably interconnected: Parts can be further decomposed into simpler parts until atomic information is obtained. The latter are usually called primitives or components or, less frequently, atoms. As detailed in [1], [16], [17], structured information is widely used in many areas of computer science as software engineering, pattern recognition, problem solving in artificial intelligence, and in other relevant scientific disciplines as robotics, chemistry, medicine, linguistics, etc. Usually, structured information is represented by means of data structures suitable to express relations existing among the primitives: lists, trees, and graphs are common examples. Among them, graphs have the highest expressive power, as they allow a description of any kind of binary relation existing among couples of primitives. Even more general data structures are the hypergraphs, which allow us

. The authors are with the Dipartimento di Informatica e Sistemistica UniversitaÁ di Napoli ªFederico II,º Via Claudio, 21 I-80125 Napoli, Italy. E-mail: {foggiapa, genna, vento}@unina.it. Manuscript received 23 July 1999; revised 20 Apr. 2000; accepted 16 June 2000. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number 112684.

to describe also n-ary relations, i.e., relations involving more than two primitives at a time. Graphs are used in a variety of forms; the most common ones are graphs enriched with a set of attributes associated to nodes and edges, called Attributed Relational Graphs (ARG). ARGs allow us to represent structural information, e.g., associating the nodes and the edges, respectively, to the primitives and to the relations among them. Usually, the attributes of the nodes and of the edges are introduced to represent the properties of the primitives and of the relations. Despite their attractiveness in terms of representational power, structural methods (i.e., methods dealing with structured information) imply complex procedures both in the recognition and in the learning processes. In fact, in real applications, the information is affected by distortions and, consequently, the corresponding graphs are very different from the ideal ones. So, in the recognition stage, the comparison among the input sample and a set of prototype graphs (usually a few prototypes per class) cannot be carried out by traditional exact graph matching procedures (isomorphism, subgraph isomorphism, etc.). Moreover, the learning problem, i.e., the task of building a set of prototypes adequately describing the objects (patterns) of each class is complicated by the fact that the prototypes, implicitly or explicitly, should include a model of the possible distortions. The difficulty of defining effective algorithms for facing this

1041-4347/01/$10.00 ß 2001 IEEE

FOGGIA ET AL.: SYMBOLIC VS. CONNECTIONIST LEARNING: AN EXPERIMENTAL COMPARISON IN A STRUCTURED DOMAIN

task is so high that the problem is considered still open and only few proposals, usable in rather peculiar hypotheses, are now available. These reasons determined, in the scientific community, the birth of two different approaches to the problem, both very recent. The rationale inspiring the first approach relies upon the conviction that structured information can be suitably encoded in order to obtain a representation in terms of a vector, thus making possible the adoption of well-known statistical/neural paradigms. The idea is appealing as the fields of neurocomputing and statistical theory make available a large variety of well-established and effective algorithms, both for learning and for classifying patterns. The main disadvantage deriving from the use of these techniques is the impossibility of accessing the knowledge acquired by the system. In fact, after learning, the knowledge is implicitly encoded (e.g., within the weights of connections of the net) and its use, outside classification stage, is strongly limited. In fact, it is generally very difficult to interpret the obtained prototypical descriptions, to understand how general they are, and to identify the knowledge chunks which allow us to discriminate between the patterns of different classes. Hinton's paper [12] is one of the most relevant among the earlier works ascribable to this approach. It deals with the main representation issues for the processing of structured information by means of neural networks. Namely, it examines different encoding schemes and proposes a method for describing structured information by vectors using a ªreduced description.º The ideas proposed in this paper are refined in [13], which proposes a concrete representation scheme, namely, the ªHolographic Reduced Representations,º together with an architecture for encoding and decoding the representation by means of convolution/correlation associative memories. In [16], a generalization of recursive neurons, allowing them to represent structured data by graphs, is introduced. A network of such neurons, each representing a part of a graph, is used to encode the whole graph as a vector; several learning algorithms, applicable to the proposed architecture, are discussed. A further extension of this approach to other paradigms, both connectionist and statistical, is described in [17]. Other papers about this topic are [14], [15], [18], [19], [20]. The second approach, pioneered by [28], contains methods significantly different from all the previous ones [8], [9], [10], [23], [26]. Instead of converting graphs into vectors and using vector-based learning paradigms, they face the learning problem directly in the representation space of the structured data. It means that, if data are represented by graphs, the learning procedure generates graphs for representing the prototypes of the classes. Due to the complexity of this problem, in this approach, it is possible to enumerate only a few attempts and some of them are not automatic at all. These can be ascribed to two rather different categories. The first category collects methods based on the assumption that the prototypical descriptions are built by interacting, more or less deeply, with an expert of the domain. To this concern, some authors [2] start from some ideal prototypes and refine manually, in successive

177

approximations, a model of the possible deformations. Others [3] assume a deformational model and use a small training set for tuning the parameters contained in it. Despite the advantage of facing the problem without a training set or at least with a small training set, this approach is effective only in simple problems. The inadequacy of human knowledge to find a set of prototypes really representative of a given class significantly increases the risk of errors, especially in domains containing either many data or many different classes. The methods ascribed to the second category face the problem in a formal way [5], [6] by considering it as a symbolic machine learning problem, so formulated: ªGiven a suitably chosen set of input data, whose class is known and possibly some background domain knowledge, find out a set of optimal prototypical descriptions for each class.º A formal enunciation of the problem and a more detailed discussion of related issues will be given in Section 2. Dietterich and Michalski [8] provide an extensive review of this field, populated by methods which mainly differ in the adopted formalism [6], [7], [9], [22], [24], [25], sometimes more general than that implied by the graphs. The advantage making this approach really effective relies in the obtained descriptions, which are explicit. Moreover, the property of being maximally general makes them very compact, i.e., containing only the minimum information for covering all the samples of a same class and for preserving distinguishability among objects of different classes, as required for understanding the features driving the classification task. Due to these properties, the user can easily acquire knowledge about the domain by looking at the prototypes generated by the system, which appear simple, understandable, and effective. Consequently, he can validate or improve the prototypes or understand what has gone wrong in case of classification errors. Unluckily, these methods are so heavy, both in terms of computational complexity and memory requirements, that only simple applications can be actually dealt with. The main reason for that can be found in the expressiveness of the adopted formalism. For instance, the formalism used in the above mentioned systems is adequate to express really complex conditions involving also negation and recursion, which rarely occur in learning problems on structured data. The generality of the formalism is paid with the generation, in the learning stage, of a large number of useless tentative prototypes, so increasing the learning cost. Furthermore, the comparison of a prototype with a sample requires a general unification algorithm, which contributes to making computationally expensive both the learning and the classification phases. At the end of this survey, including systems ascribable to the symbolic and the neural areas, pros and cons of the two approaches should be clear, but a main practical question still remains open: ªGiven a problem implying structural learning in a fixed application domain, is it possible to understand which approach is better suited?º Though the investigation on this point could be interesting, the focus of this paper is on a rather different point. We start from the convinced opinion that there is no best way to handle structured information (and information in general)

178

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

and both the symbolic and the connectionist approaches have their peculiar virtues and deficiencies. Moreover, as discussed in [33], each approach, taken singularly, will prove to be too weak to achieve the versatility that we need in solving complex learning tasks: A purely connectionist system will lack the precision and the rigor needed for complex reasoning schemes like deductive inference, while a purely symbolic system will not be able to deal with the uncertainty and the approximation, nor to perform the analogical kind of inference that plays an important role in the way human beings exploit their knowledge. Hence, in our view, the symbolic and the connectionist approaches should not be seen as competing for supremacy; instead, only their cooperative integration can provide us with more powerful tools for knowledge exploitation. The first step toward the integration of the two methodologies is their combined use in a multiexpert system, i.e., a system which uses simultaneously different classifiers, whose outputs are fed to a combiner which, on the basis of the answers coming out from each single classifier, takes the final classification decision [35], [36]. As widely demonstrated in the literature, an accurate selection of the combining criteria allows us to significantly improve the results obtained by the single experts by retaining their strengths, while overcoming their weaknesses. The multiexpert approach, which in our case can be applied to combine a symbolic system and a neural one, requires the preliminary verification that the classification results are orthogonal enough, i.e., there is a significantly high number of samples misclassified by one system but correctly recognized by the other and vice versa. In this paper, we follow this guideline with reference to two algorithms explicitly designed for learning structural informationÐa recent neural network [17] and a symbolic prototyper proposed here. Details on the two methods will be given in the next sections. The algorithms are used for facing the same problem and the consequent analysis of the results is aimed at verifying the performance improvements coming from their combined use in a parallel multiexpert system. This analysis encourages the exploitation of more complex schemes for making the structural and the neural approach cooperating in a more efficient way. The paper is organized as follows: In Section 2, we first introduce the rationale of the proposed symbolic algorithm and, successively, the adopted representational scheme. To this concern, novel kinds of graphs, called Generalized Attributed Relational Graphs (GARGs), are defined and the role they assume as prototypes is discussed. In Section 3, the proposed learning algorithm is presented in detail, while Section 4 reports a trace of the learning process with reference to a learning problem involving three classes. The latter has been suitably chosen to discuss and point out the main features and the crucial points of the method. Section 5 reports the results of an experimental comparison between our symbolic learning algorithm and the connectionist approach cited above. The experimental analysis refers to a problem of handwritten character recognition on a standard database available at the site: ftp://ftp.ics.uci.edu/pub/machine-learning-databases/artificial-characters/. Section 5 also reports the

VOL. 13,

NO. 2,

MARCH/APRIL 2001

experimental results of both the methods. Section 6 first describes, by a theoretic point of view, the pros and cons of the symbolic and the connectionist methods and then analyzes in details the experimental results obtained by the two approaches. The discussion is aimed at evidencing the orthogonality of the results obtained by the two approaches and how to integrate them in a system which, using simultaneously both of them, improves the performance. Finally, in the last section, some promising perspectives for a tighter integration of the connectionistic and symbolic approaches are outlined.

2

THE PROPOSED SYMBOLIC LEARNING METHOD

The rationale of our approach is that of considering descriptions given in terms of Attributed Relational Graphs and devising a method which, inspired to basic machine learning methodologies, particularizes the inference operations to the case of graphs. To this aim, we first introduce a new kind of Attributed Relational Graph, devoted to represent prototypes of a set of ARGs. These graphs are called Generalized Attributed Relational Graphs (GARGs) as they contain generalized nodes, edges, and attributes. Then, we formulate a learning algorithm which builds such prototypes by means of a set of operations directly defined on graphs. The algorithm preserves the generality of the prototypes generated by classical machine learning algorithms. Moreover, similarly to most of machine learning systems [5], [6], [7], [8], the prototypes obtained by our system are consistent, i.e., each sample is covered by only one prototype. We assume that the objects are described in terms of Attributed Relational Graphs (ARG). An ARG can be defined as a 6-tuple …N; E; AN ; AE ; N ; E †, where N and E  N  N are, respectively, the sets of the nodes and the edges of the ARG, AN and AE the sets of nodes and edge attributes and, finally, N and E the functions which associate to each node or edge of the graph the corresponding attributes. We will assume that the attributes of a node or an edge are expressed in the form t…p1 ; . . . ; pkt †, where t is a type chosen over a finite alphabet T of possible types and …p1 ; . . . ; pkt † are a tuple of parameters, also from finite sets P1t ; . . . ; Pktt . Both the number of parameters (kt , the arity associated to type t) and the sets they belong to depend on the type of attribute; for some type kt may be equal to zero, so meaning that the corresponding attribute has no parameters. It is worth noting that the introduction of the type permits us to differentiate between the description of the different kinds of nodes (or edges); in this way, each parameter associated to a node (or an edge) assumes a meaning depending on the type of the node itself. For example, we could use the nodes to represent different parts of an object, by associating a node type to each kind of part (see Fig. 1). It's worth noting that the availability of different types allows us to use ARGs for describing even more complex structures. For example, we can also represent a ternary relation by introducing a new node type for representing

FOGGIA ET AL.: SYMBOLIC VS. CONNECTIONIST LEARNING: AN EXPERIMENTAL COMPARISON IN A STRUCTURED DOMAIN

179

Fig. 1. An example of the use of the type information: (a) A set of objects made of three different kinds of parts (circles, triangles, rectangles). (b) The description scheme introduces three types of nodes, each associated to a different part. Each type contains a set of parameters suitable for describing each part. Similarly, edges of the graph, describing topological relations among the parts, are associated to two different types. (c) The graphs corresponding to the objects in (a).

the relation itself. As shown in Fig. 2, using this technique, it is possible to represent Attributed Hypergraphs by means of ARGs. The latter property allows us to apply the learning method to Attributed Hypergraphs, too. Let us introduce a generalization of ARGs, called Generalized Attributed Relational Graphs (from now on GARGs). A GARG is used for representing a prototype of a set of ARGs; in the following, we will provide a formal definition of GARGs while, in the next section, the use of GARGs as prototypes of ARGs will be described in more detail. In order to allow a GARG (i.e., the prototype it represents) to match a set of possibly different ARGs (the samples covered by the considered prototype), we extend the attribute definition. First of all, the set of types of node and edge attributes is extended with the special type , carrying no parameter and allowed to match any attribute type, ignoring the attribute parameters. For the other attribute types, if the sample has a parameter whose value is within the set Pit , the corresponding parameter of the prototype belongs to the set Pit ˆ }…Pit †, where }…X† is the power set of X, i.e., the set of all the

subsets of X. Referring to the previous example of the geometric objects, a node of the prototype could have the attribute rectangle…fs; mg; fmg†, meaning a rectangle whose width is small or medium and whose height is medium. This corresponds to the internal disjunction found in languages like Michalski's VL21 [6]. We say that a GARG G ˆ …N  ; E  ; AN ; AE ; N ; E † covers a sample G and use the notation G ƒ G (the symbol ƒ denotes the relation called covering) iff there is a mapping  : N  ! N such that: 1.

 is a monomorphism; that is: n1 6ˆ n2 ) …n1 † 6ˆ …n2 † and 8…n1 ; n2 † 2 E  ; ……n1 †; …n2 †† 2 E:

2.

…1†

The attributes of the nodes and of the edges of G are compatible with the corresponding ones of G; that is: 8n 2 N  ; N …n †  N ……n †† and 8…n1 ; n2 † 2 E  ; E …n1 ; n2 †  E ……n1 †; …n2 ††;

…2†

180

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

VOL. 13,

NO. 2,

MARCH/APRIL 2001

Fig. 2. The use of type information as a means for representing an Attributed Hypergraph in terms of an ARG. (a) An object and (b) its description in terms of ARG using the alphabets introduced in Fig. 1a. (c) A different possible description by a hypergraph using a ternary relation for representing ªcollinearity.º The hypergraph can be described by introducing a new node type (gray in the figure) representing the ternary relation and the edge types for connecting the nodes involved in the relation. (d) The addition to the alphabets T of Fig. 1c needed to support the new types and (e) the obtained ARG.

where the symbol  denotes a relation, called compatibility relation, defined as follows: 8t;   t…p1 ; . . . ; pkt † and 8t; t…p1 ; . . . ; pkt †  t…p1 ; . . . ; pkt †

…3†

, p1 2 p1 ^ . . . ^ pkt 2 pkt : Condition (1) requires that each primitive and each relation in the prototype be present also in the sample; note that the converse condition does not hold, i.e., the sample can have additional primitives/relations not considered by the prototype. This allows the prototype to specify only the features which are strictly required for discriminating among the various classes, neglecting the irrelevant ones. Condition (2) constrains the monomorphism required by condition (1) to be consistent with the attributes of the prototype and of the sample, by means of the compatibility relation defined in (3): This latter simply states that the type of the attribute of the prototype must be either equal to  or to the type of the corresponding attribute of the sample; in the latter case, all the parameters of the attribute, which are actually sets of values, must contain the value of the corresponding parameter of the sample. Another important relation that will be introduced is / ): A prototype G1 is specialization (denoted by the symbol ÿ  said to be a specialization of G2 iff:

8G; G1 ƒ G ) G2 ƒ G:

…4†

In other words, a prototype, G1 is a specialization of G2 if every sample covered by G1 (not only the samples in the training set) is also covered by G2 . Hence, a more specialized prototype imposes stricter requirements on the samples to be covered. In Fig. 3, a prototype covering some objects and a specialization of it are reported. From its definition, it derives that the specialization relation introduces a nontotal ordering in the prototype space; it can be proved that the empty prototype (i.e., the prototype with no nodes and, consequently, no edges) is the minimum of this relation, and so any prototype can be seen as a specialization of it. If we neglect the empty prototype, the minimum of the specialization relation is the prototype having only one node with attribute ; any nonempty prototype is a specialization of this trivial prototype which covers all the nonempty samples. As we will see in the next section, our algorithm exploits this property by starting from this trivial prototype and specializing it with successive modifications.

3

LEARNING GARGs: THE PROPOSED ALGORITHM

The goal of the learning algorithm can be stated as follows: there is a (possibly infinite) set S  of all the patterns that may occur, partitioned into C different classes S1 ; . . . ; SC , with

FOGGIA ET AL.: SYMBOLIC VS. CONNECTIONIST LEARNING: AN EXPERIMENTAL COMPARISON IN A STRUCTURED DOMAIN

181

Fig. 3. (a) A GARG representing the set of the four different ARGs associated to objects presented in (b), whose ARGs are given in Fig. 1c. Note that, for the sake of clarity, we have used the disjunction (or) instead of the usual set-theoretic notation. Informally, the GARG represents ªany object made of a part on the top of a rectangle of any width and height.º (c) A specialization of the GARG given in (a), obtained by adding a node and an edge, and (d) the objects covered by it. Informally, the latter GARG represents ªany object made of a part on the top of two other parts, that are a rectangle with a large height and any width and another unspecified part.º

Si \ Sj ˆ ; for i 6ˆ j; the algorithm is given a finite subset S  S  (training set) of labeled patterns (S ˆ S1 [ . . . [ SC with Si ˆ S \ Si ), from which it tries to find a sequence of prototype graphs G1 ; G2 ; . . . ; Gp , each labeled with a class identifier, such that 8G 2 S  9i : Gi ƒ G …completeness of prototype set†

…5†

and 8G 2S  Gi ƒ G ) class…G† ˆ class Gi …consistency of the prototype set†;

…6†

where class…G† and class…G † refer to the class associated with samples G and G , respectively. Of course, this is an ideal goal since only a finite subset of S  is available to the algorithm; in practice, the algorithm can only demonstrate that completeness and consistency hold for the samples in S. On the other hand, (5) dictates that, in order to get as close as possible to the ideal case, the prototypes generated should be able to model samples also not found in S, that is, they must be more general than the enumeration of the samples in the training set. However, they should not be too general otherwise (6) will not be satisfied. The achievement of the optimal trade-off between completeness and consistency makes the prototypation a really hard problem. To this concern, our definition of the covering relation, which allows the sample to have nodes and edges not present in the prototypes, is aimed at increasing the generality of the prototypes themselves; in fact, each prototype can specify only the distinctive features of a class, i.e., the ones which allow the class samples to be distinguished from those of other classes; optional features are left out from the prototype and their presence or absence has no effect on the classification. It's worth pointing out that the representation of the prototypes in terms of GARGs does not make provisions for expressing negation, i.e., for specifying that some features must not be present for the sample to be covered by the prototype. While this lack limits to some extent the expressiveness of the prototypes, it entails a significant

performance benefit for the learning algorithm. In fact, without negation, the test of the covering relation involves a graph monomorphism between the structure of the prototype and the one of the sample since each primitive and each relation of the prototype must have a correspondent in the sample. The search for a monomorphism can be performed using a graph matching algorithm [29], [30], [4], [32], which can be quite efficient, at least in the average case; the test of the compatibility of the attributes can be easily incorporated in most graph matching algorithms. On the other hand, the addition of negation would imply the need of a more complex definition of the covering relation, requiring a more expensive algorithm for its verification. In fact, consider that, in the presence of negation, the addition of a feature to a prototype will not result necessarily in a specialization of that prototype. For instance, Fig. 4 presents a case with two prototypes where the second is derived from the first by adding a negated node (i.e., a node expressing a feature that must not be present). Although the latter contains one more node, it covers more samples and so it is more general than the previous prototype. Another problem related to negation is that, if the prototype can specify a list of undesired features, the termination in a finite time of the learning algorithm cannot be ensured (unless some artificial limitation is imposed on the generated prototypes); in fact, it may happen for a particular training set that this list can grow indefinitely without reducing the number of covered training samples. However, there are cases in which negation is essential for defining discriminant prototypes: A notable example of such cases are domains in which the patterns of some class can be viewed as subpatterns of another class. In order to deal with these situations, our algorithm uses a modified version of (6), which introduces a precedence relation among the prototypes, in a way which is similar to the approach of k-Decision Lists [27]. In particular, our method considers the order in which the prototypes are generated, assuming that Gi has precedence over Gj iff i < j: we say that the prototype sequence is consistent iff:

182

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

VOL. 13,

NO. 2,

MARCH/APRIL 2001

Fig. 5. A sketch of the learning procedure. Fig. 4. A potential problem arising when prototype description includes negation. For the sake of simplicity, the example does not consider relations between the parts. (a) A prototype representing ªany object containing a rectangle and not a circleº (negation is graphically expressed by means of a slash on the corresponding node). (b) The objects covered by the prototype (the gray one is the only one not covered). In (c) a prototype derived from (a) by adding a negated node; it represents ªany object containing a rectangle and not two circles.º (d) The objects covered by the prototype (c). Note that (c) is obtained by (a) by adding a negated node, but covers more samples than (a); consequently, (c) cannot be considered as a specialization of (a).

8G 2 S 

n o 9i ˆ min j j Gj ƒ G ) class…G† ˆ class…Gi †: …7†

In other words, a sample is compared sequentially against the prototypes in the same order in which they have been generated, and it is attributed to the class of the first prototype that covers it. In this way, each prototype implicitly entails the condition that the sample is not covered by any previous prototype. Thus, with a careful choice of the order in which the prototypes are generated, the problems arising when the samples of a class are subpatterns of another class are avoided. One of the strength points of the proposed learning method is the automatic handling of these situations by adopting a learning strategy which considers simultaneously all the classes that must be learned in order to determine (without hints from the user) the proper ordering for the prototypes. A sketch of the learning algorithm is presented in Fig. 5; the algorithm starts with an empty list L of prototypes and tries to cover the training set by successively adding consistent prototypes. When a new prototype is found, the samples covered by it are eliminated and the process continues on the remaining samples of the training set. The algorithm fails if no consistent prototype covering the remaining samples can be found. It is worth noting that the test of consistency in the algorithm actually checks whether the prototype is almost consistent, i.e., almost all the samples covered by G belongs to the same class: Consistent…G † , max i

j Si …G † j  ; j S…G † j

…8†

where S…G † denotes the sets of all the samples of the training set covered by a prototype G , and Si …G † the samples of the class i covered by G , i.e., S…G † ˆ fG 2 S j G ƒ Gg and Si …G † ˆ fG 2 Si j G ƒ Gg: In (8),  is a threshold close to 1 that is used to adapt the tolerance of the algorithm to slight inconsistencies in order to have a reasonable behavior also on noisy training data. For example, with  ˆ 0:95, the algorithm would consider consistent a prototype if more than 95 percent of the covered training samples belong to a same class, avoiding a further specialization of this prototype that could be detrimental for its generality. Note that the assignment of a prototype to a class is done after the prototype has been found, meaning that the prototype is not constructed in relation to an a priori determined class: The algorithm finds at each step the class which can be better covered by a prototype and generates a prototype for it. In this way, if the patterns of a class j can be viewed as subpatterns of samples of another class i (e.g., the graphs describing the character ªFº are often subgraphs of those representing character ªEº), the algorithm will cover first the class i and then the class j; in this case, we say that the prototypes of the class i have precedence over those of class j, according to (7). The most important part of the algorithm is the FindPrototype procedure, illustrated in Fig. 6. It performs the construction of a prototype, starting from a trivial prototype with one node whose attribute is  (i.e., a prototype which covers any nonempty graph), and refining it by successive specializations until either it becomes consistent or it covers no samples at all. The FindPrototype algorithm is greedy in the sense that, at each step, it chooses the specialization that seems to be the best one, looking only at the current state without any form of look-ahead. An important step of the FindPrototype procedure is the construction of a set Q of specializations of the tentative prototype G described in details in Section 3.1. The adopted definition of the heuristic function H, guiding the search of the current optimal prototype, will be examined later.

3.1 The Specialization Operators At each step, the algorithm tries to refine the current prototype definition in order to make it more consistent by

FOGGIA ET AL.: SYMBOLIC VS. CONNECTIONIST LEARNING: AN EXPERIMENTAL COMPARISON IN A STRUCTURED DOMAIN

183

Fig. 6. The function FindPrototype.

Fig. 7. The function Specialize.

replacing the tentative prototype with one of its specializations. To accomplish this task, we have defined a set of specialization operators which, given a prototype graph G / produce a new prototype G such that G ÿ G . The considered specialization operators are: 1. 2.

3.

Node addition: G is augmented with a new node n whose attribute is . This operator is always applicable. Edge addition: A new edge …n1 ; n2 † is added to the edges of G , where n1 and n2 are nodes of G and G does not contain already an edge between them. The edge attribute is . This operator is applicable if G is not a complete graph. Attribute specialization: The attribute of a node or an edge is specialized according to the following rule: .

If the attribute is , then a type t is chosen and the attribute is replaced with t…P1t ; . . . ; Pktt †. This means that only the type is fixed, while the type parameters can match any value of the corresponding type. . Else, the attribute takes the form t…p1 ; . . . ; pkt †, where each pi is a (nonnecessarily proper) subset of Pit . One of the pi such that pi > 1 is replaced with pi ÿ fpi g with pi 2 pi . In other words, one of the possible values of a parameter is excluded from the prototype. Note that, except for the node addition, the specialization operators can be usually applied in several ways to a prototype graph; for example, the edge addition can be applied to different pairs of nodes. In these cases, it is intended that the function Specialize exploits all the possibilities (Fig. 7).

3.2 The Heuristic Function The heuristic function H is introduced for evaluating how promising the provisional prototype is. It is based on the estimation of the consistency and completeness of the

prototype (see (5) and (6)). To evaluate the consistency degree of a provisional prototype G , we have used an entropy based measure: Hcons …S; G † ˆ I…S† ÿ I…S…G ††;

…9†

where I…S† ˆ ÿ

X jSi j jSi j log2 jS j jS j i

and I…S…G †† ˆ ÿ

X jSi …G †j i

jS…G †j

log2

jSi …G †j : jS…G †j



I…S…G †† represents the quantity of information (in bits) necessary to express the class a given element of S…G † belongs to; its value is zero in the case that all the samples belong to a same class, and is log2 C if all the classes have the same number of occurrences (in this case, G does not provide any information about the class of the sample). It follows that the larger the value of Hcons …S; G † is, the more consistent G is; hence, the use of Hcons will drive the algorithm toward consistent prototypes. The completeness of a provisional prototype is taken into account by a second term of the heuristic function, which simply counts the number of samples covered by the prototype: Hcompl …S; G † ˆ jS…G †j:

…10†

This term is introduced in order to privilege general prototypes with respect to prototypes which, albeit consistent, cover only a small number of samples. The heuristic function used in our prototyping algorithm is given by the product of Hcompl and Hcons : H…S; G † ˆ Hcompl …S; G †  Hcons …S; G † ˆ jS…G †j  …I…S† ÿ I…S…G †††:

…11†

184

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

VOL. 13,

NO. 2,

MARCH/APRIL 2001

Fig. 8. (a) The training data of the example problem. (b) The alphabets used for representing the images in terms of ARGs: The nodes are associated with geometric figures and have a single node type with two parameters: shape with three values (triangles, rectangle, and circle) and size with three values (small, medium, and large). The edges of the graph represent the ªinsideº relation with no parameter. (c) The ARGs corresponding to the considered images.

Of course, this particular heuristic function is not the only possible choice that satisfies the requirement of favoring the more consistent and complete prototypes nor it is guaranteed to be the best one for each possible kind of application domain. On the other hand, our algorithm is able to work with any other heuristic function as well; we have adopted this particular one here because it is simple, is based on well understood and widely used functions [22], and has proven to be effective in many test cases.

4

A LEARNING EXAMPLE

In this section, with reference to a simple learning problem involving structural descriptions, we present, in detail, all the steps of the proposed algorithm. The problem has been derived from the well-known Bongard's tests [31], slightly complicated to better highlight some crucial points of the

learning process. While Bongard's tests imply learning problems with two classes, we have considered three classes obtained by fusing two different Bongard's tests, namely, the 71 and the 47. We recall that this test contains simple geometric shapes arranged according to unknown rules. The aim of the test is to discover a property verified by any of the six images of the considered class and by none of the other classes (see Fig. 8a). Each image is described in terms of an ARG in which the nodes are associated to the geometric objects in the image and the edges represent the ªinsideº relation. The node attributes have a single type with two parameters: the shape, with three possible values (triangle, rectangle, and circle) and the size, with three values (small, medium, and large). The edge attributes have a single type with no parameter, as the inside relation is not further detailed. Fig. 8b presents the adopted description scheme by

FOGGIA ET AL.: SYMBOLIC VS. CONNECTIONIST LEARNING: AN EXPERIMENTAL COMPARISON IN A STRUCTURED DOMAIN

185

Fig. 9. The GARG's prototypes found by the algorithm and, for each of them, an informal description. The symbol ª?º has been used for denoting, in a compact way, any possible value of the corresponding parameter.

formally introducing the list of types and the corresponding attributes. Finally, Fig. 8c reports the description of all the images involved in the considered test. Fig. 9 reports the prototypes found by the algorithm. The first prototype G1 , informally described as ªany image containing an object inside an object which is in turn inside another object and a fourth object,º covers entirely class 1. Note that this prototype is slightly redundant; in fact, the fourth object is not strictly necessary since its elimination allows us to cover again all the samples of the class 1 without covering any of the images of the other two classes. The spurious generation of an additional node has to be attributed to the greedy behavior of the algorithm. Later in this section, the origin of this redundancy will be better discussed. Anyway, a suitable postprocessing of generated prototypes would allow us to obtain more compact prototypes. The second prototype G2 , informally ªany image containing a triangle or a rectangle inside another object,º covers all the samples belonging to class 2. It is worth noting that, according to the rationale of the approach, the definition of the prototype G2 applies only after the verification that the considered sample does not match the previously determined prototypes, in this case G1 . Under this assumption, it would be more precise to define G2 as ªany image which does not match G1 and contains a triangle or a rectangle inside another object.º For the sake of simplicity, in the following sections of the paper, we will adopt the former way of describing prototypes, retaining implicitly said that any prototype definition apply after the verification of the ones previously found. Finally, since the first two prototypes cover all the samples of classes 1 and 2, the algorithm uses the trivial prototype G3 , informally ªany object,º to cover the samples of class 3. Fig. 10 shows the results of the intermediate steps of the learning algorithm; in particular, the sequence of the provisional prototypes found by the function FindPrototype are presented, together with the covered training samples. This figure makes it clear why the algorithm generates a slightly redundant prototype for the class 1: The addition of the fourth node during the construction of G1 arises at Step 2.4 because it determines the elimination of a sample from class 3, so increasing the consistency of the prototype of class 1. The successive addition of two edges excludes all

the samples from classes 2 and 3, thus making useless the fourth node of the prototype added at the previous step, but the algorithm does not backtrack and, thus, the node is not removed. To provide a better understanding of the role played by the heuristic function in the construction of the prototype, we have reported in Fig. 11 all the alternative choices considered by the algorithm while building the prototype G2 . Together with each alternative, the set of the covered samples and the corresponding value of the heuristic function are shown. It is worth noting that the addition of the edge is preferred to the addition of the third node because, even though both the specializations are equally consistent, the former is more complete.

5

SYMBOLIC VS. CONNECTIONIST: AN EXPERIMENTAL CASE STUDY

In this section, we report the results of an experimental comparison between our symbolic learning algorithm and the connectionist approach to the structural learning problem [17]. The latter has been chosen as it adopts a more general representation scheme in terms of graphs than all the connectionist methods dealing with structured data. In particular, it works on directed, acyclic, and ordered graphs (from now on DOAGs) which are a very general kind of graphs, yet less general than the ARGs used by our learning algorithm. From now on, for the sake of simplicity, we will call symbolic our algorithm and neural the algorithm [17]. The two methods have been experimented on a character recognition problem using a standard database. It consists of 6,000 characters of ten capital letters (A, C, D, E, F, G, H, L, P, R), which have been artificially distorted according to a random model described in [21]. The whole database is partitioned into a training set of 100 samples per class, partially represented in Fig. 12 and a test set of 500 samples per class. Each character of the database is represented by means of a set of line segments characterized by their endpoints. No processing has been carried out on characters before their description in terms of graphs. The nature of the database would make advisable an improvement of the character representation, for example, by approximating sequences of straight segments with second order components, as circular arcs. Chianese et al.'s paper [11]

186

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

VOL. 13,

NO. 2,

MARCH/APRIL 2001

Fig. 10. A trace of the algorithm: To the left are the processing actions, in the middle are the current prototypes, and to the right, the set S of samples covered by it. Bold rectangles denote prototypes which become definitive, while bold nodes and edges represent the additions to the previous provisional prototypes.

reports a method for obtaining a description scheme of this type.

Fig. 13 illustrates the adopted description scheme in terms of ARGs; basically, we have two node types

FOGGIA ET AL.: SYMBOLIC VS. CONNECTIONIST LEARNING: AN EXPERIMENTAL COMPARISON IN A STRUCTURED DOMAIN

187

Fig. 11. The alternative specializations explored by the function Specialize during the construction of the prototype G (see Steps 2.7 to 2.10 of Fig. 10). Numbers on the arrows represent the values of the heuristic function for each specialization. Thick arrows denote the specializations actually selected.

respectively used for representing the segments and their junctions, while the edges represent the adjacency of a segment to a junction. Node attributes are used to

encode the size of the segments (normalized with respect to the size of the whole character) and their orientation; the edge attributes represent the relative

188

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

VOL. 13,

NO. 2,

MARCH/APRIL 2001

Fig. 12. Some of the samples of the training set. Note the variability within each class in terms of the number of segments, their size and orientation, and of the interconnections among them.

position of a junction with respect to the segments it connects. Since the neural method adopted does not make provisions for edge attributes, this latter information is associated to junction nodes when the graphs are encoded in the format required by the neural software, as detailed below. Each node has three attributes. The first one discriminates between types line and join of Fig. 13 using the values 0 and 1. For line nodes, the other two attributes encode the quantized size and angle with integer values ranging from 0 to 4. For join nodes, they are used to encode the information related to the x- and y-projection of the

junction point (ranging from 0 to 2) which, in the symbolic system, is represented by means of edge attributes.

5.1 The Results of the Symbolic Algorithm The graphs representing the training samples have been submitted to our learning algorithm, which generated a list of 74 prototypes, consistent and complete with respect to the training set. Fig. 14 shows the number of prototypes per class and, for each of them, the percentage of samples of the class it covers. As clear from Fig. 14, most of the classes are covered at 80 percent by only using a small number of prototypes (2 or 3) per class. These prototypes have a high

FOGGIA ET AL.: SYMBOLIC VS. CONNECTIONIST LEARNING: AN EXPERIMENTAL COMPARISON IN A STRUCTURED DOMAIN

189

Fig. 13. (a) An example of a character of the database. Its representation is in terms of straight segments (numbered from 0 to 4) and connections among them (numbered from 5 to 11). (b) The formal description of node and edge types. Nodes of the ARG are used for describing both the lines (type line) and the connections among them (type join). Nodes associated to lines have two parameters, namely, the size and the angle of the segments with respect to x axis. Joins have no parameters. The edges of the graphs are used for describing properties of connections of segments: They denote the relation position of the projections of the connection and the segment on x…y† axis. (c) The corresponding graph. Nodes of type join are drawn in gray.

Fig. 14. The coverage of the 74 prototypes found, expressed as a percentage of the number of samples of the corresponding class; within each class, prototypes are denoted with different colors and ordered as a function of their coverage. D, E, and G are well-represented by a small number of prototypes. The class with the highest number of prototypes is H.

coverage and capture the major invariants of the character

a few characters which, because of noise, are quite

shapes inside a class. The remaining prototypes account for

dissimilar from the average samples of their class.

190

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

VOL. 13,

NO. 2,

MARCH/APRIL 2001

Fig. 15. A visual analysis of the samples of the training set covered by the found prototypes. For each class, only the most representative prototype has been selected, i.e., the one having the highest coverage, reported in the second column of the table.

A visual impression of the effectiveness of the prototypes can be obtained by looking at Fig. 15. It presents the training set covered by the most representative prototype of each class (i.e., the ones having the maximum coverage). It can be noted that the samples covered by each prototype, even belonging to the same class, are quite different from each other; this confirms the capability of the prototypes of catching the invariance of the structural data even in domains with large variability. Fig. 16 reports the recognition rate obtained on each class, while the whole classification matrix is summarized in the white columns of Table 1 (the dark ones refer to the experimentation of the neural algorithm). The average recognition rate is 90.3 percent, which is a good result if we consider that the samples are highly noisy and, as anticipated, no effort has been taken in the description

phase to reduce the variability of the graphs representing the characters. It is also worth noting that misclassification is due mainly to confusion between classes whose samples share a common subgraph. For example, the highest error occurs in the case of As whose 18 percent of samples are classifies as Hs. The errors refer to all the As made of two perfectly vertical straight lines, connected in the middle and on the top by two horizontal straight lines. These As can be obtained adding a horizontal straight line on the top of most of the Hs. Similar considerations apply to other categories of errors. To give an idea of the time performance of the algorithm, for this experimentation, it has been implemented in C++ and has been executed using a PC with a 500MHz Pentium III processor and 128 Mb of RAM under Linux. The time required by the learning process on the considered example was about five hours.

FOGGIA ET AL.: SYMBOLIC VS. CONNECTIONIST LEARNING: AN EXPERIMENTAL COMPARISON IN A STRUCTURED DOMAIN

191

Fig. 16. Recognition rate of the compared systems on the test set. Note that the recognition rate of the neural algorithm exhibits greater variations for the different classes (e.g., class H gets a rate as low as 16 percent, while the rate of our systems is never lower than 60 percent); nevertheless, five (out of ten) classes are exactly recognized by the neural network versus the only one of our symbolic algorithm.

5.2 The Results of the Connectionist Algorithm For the experimentation of the considered neural algorithm, we have used the public code developed by Maggini, which can be retrieved at http://www.dii.unisi.it/~apods. Since the graphs used for representing the adopted data set are directed and acyclic graphs, as in the hypotheses of this algorithm, no preliminary graph transformation has been made on the training set used for the symbolic experimentation, except the already explained attribute encoding. The other restriction the neural algorithm imposes on the input graphs is that there is a well-defined total ordering among the branches associated to each node. Actually, our data set is not guaranteed to comply with this requirement since there is no natural ordering among the junctions adjacent to a given segment. On the other hand, because of the way the graphs are constructed, only a few different branch permutations are likely to occur. These permutations are seen as different graphs by the neural network, so increasing the number of variants of each class to be taken into account. However, all these

variants are well-represented in the training set, allowing the network to overcome this problem, as is indirectly confirmed by the fact that there is no significant performance loss in the test set with respect to the training set. The neural network uses a three layer architecture with three inputs, four channels, and 10 outputs. We have tested different numbers of state and hidden neurons; the best result has been obtained with 20 state neurons and 40 hidden neurons. The adopted learning algorithm is the standard Back-Propagation Through Structure (BPTS) with a learning rate of 0.1. The network has been trained for 100,000 learning cycles, which took about 27 hours on the same system described for the symbolic experimentation. A subset of the test set has been used to periodically validate the performance, in order to avoid the overtraining phenomenon; the network has reached its best performance weight configuration in about eight hours.

TABLE 1 The Classification Matrices of the Two Methods on the Test Set (Null Values Are Not Printed Out)

The gray columns refer to the neural algorithm, while the white ones to ours. Values of main diagonal represent the recognition rate of the classes, while the value at row i and column j…j 6ˆ i† denotes the percentage of samples of class i erroneously attributed to class j.

192

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

Fig. 16 presents the recognition rate achieved by the two algorithms and Table 1 the classification matrices. It should be noted that, apart from the class A, the symbolic system introduces more confusion classes (e.g., see columns headed C or F) than the neural one; conversely, the neural algorithm misclassifies a higher number of samples than ours (see columns A,L,R): The average misclassification rate is 19.3 percent for the neural system versus 9.74 percent of the symbolic one. As it is evident from the table, the overall performance of the symbolic algorithm is actually better than that of the neural network. However, the reported results do not yet allow us to draw a definitive conclusion about which approach is better. In fact, neglecting the problem of class ªHº (which could be probably resolved by tweaking the data representation), the two systems are quite close in their performance. Hence, in order to formulate an answer to our dilemma, we need to examine with more detail than that provided by a simple performance measure the differences between the two approaches. The following sections will discuss these differences from both the theoretical and the experimental points of view.

6

CONNECTIONIST VS. SYMBOLIC: A THEORETICAL COMPARATIVE ANALYSIS

A first difference between our method and the connectionist approaches cited above lies in the formalism adopted for representing information: our system employs the rich grammar of ARGs, which, albeit not as powerful as full first order predicate logic, is able to express almost any kind of relation which can be of practical interest. The connectionist approaches proposed so far, instead, impose some restrictions on the kind of graphs that can be managed: Chiefly, the edges are constrained to be ordered and node valence has to be bounded, since some sort of correspondence has to be established between the structure of the graphs and the structure of the network which deals with them. In regards to the attributes of the graphs, a purely symbolic approach (such as our method) often requires that they are chosen from finite alphabets, thus forcing, in many cases, some form of quantization of the original values. On the other hand, connectionist approaches can easily deal with continuous attribute values. A more important difference is the representation of the generated knowledge about classes. Class descriptions found by neural networks are implicitly kept in the weights of the neuron interconnections; it is not an easy task to recover from those weights any kind of meaningful information about the classes. Thus, the system is able to perform a classification, but not to explain the criteria that guided its choice. On the contrary, our system generates explicit, declarative knowledge on the classes in the form of GARGs that may be understood with little effort by a human expert: a direct inspection of such explicit prototypes can be sufficient to validate them or possibly to delete the badly formed prototypes that could be detrimental for classification performance. The fact that our learning system constructs the prototypes directly in the space of graphs has another main

VOL. 13,

NO. 2,

MARCH/APRIL 2001

advantage: It is possible to add some background knowledge specific to the particular domain, in order to speed up the learning process by cutting off useless tentative prototypes (so pruning the search space), or to direct the learning process toward the most effective class descriptions. In a connectionist approach, instead, there is no easy way to integrate background knowledge in the learning process, although some methods have been proposed with reference to special kinds of knowledge [14]. It is worth noting that explicit prototypes can be exploited also in the design of a heuristic function suitable for the application domain examined: By following the refinements of a prototype step by step, the domain expert can identify any undesired pruning in the search space due to the considered heuristic function and change it accordingly. After comparing the two approaches with reference to the representation of knowledge and its direct implications, let's turn our attention to the strategies employed. A symbolic learning system is able to guarantee a description both coherent and consistent with respect to the target classes, if such a description exists. On the other hand, neural networks exhibit a number of advantages when the training examples are not close to their ideal form. First of all, neural networks show a robustness with respect to noisy or approximated data that even state-of-the-art symbolic systems do not demonstrate, though some efforts in the symbolic machine learning community have been aimed at improving this aspect [37]. Moreover, in the connectionist approach, the (implicit) class descriptions converge to their final form in a gradual and progressive way; at any moment, the network encodes a more or less approximated version of the problem solution. Because of this strategy, stopping learning will result in less accurate but still usable class descriptions. On the other hand, symbolic systems usually produce the class prototypes sequentially; if the learning process is interrupted before it completes, each output prototype will be as accurate as if learning was finished, but a certain number of samples will remain uncovered with evident detriment to the classification performance. Another important advantage of neural networks with respect to our symbolic approach is the possibility of estimating quantitatively the classification reliability from the network output. The output of a neural network is not a mere binary result; each component of the output vector may be designed to assume a continuous range of values, thus encoding the confidence of the network in its answer. This property can be exploited in the design of a robust reject option [36], which is needed in many reliabilitycritical applications.

6.1

Connectionist vs. Symbolic: An Experimental Comparison Now, let us turn to examine the behavior of the two algorithms on the selected test application. Given the many intrinsic dissimilarities outlined above, it can be useful to ascertain if the similarity of classification results implies that the two algorithms behave in the same way on the input samples and are deceived by the same kind of

FOGGIA ET AL.: SYMBOLIC VS. CONNECTIONIST LEARNING: AN EXPERIMENTAL COMPARISON IN A STRUCTURED DOMAIN

193

Fig. 17. Coverage of the test set: The percentages depicted report which system recognizes the test samples (clockwise: none of them, only the symbolic one, only the neural, both). (a) The coverage computed on the whole test set. (b) The same information for the samples of class A (the only class for which both systems get a poor resultÐsee Fig. 16).

patterns; this important information cannot be derived from an integral measure like the classification table presented above. Thus, we have expressly computed the fraction of test samples that are correctly classified by both the systems, by only one of them, and by none of the two (see Fig. 17). These data demonstrate what we could reasonably expect from the theoretical analysis of the algorithms: The two systems exhibit a high degree of orthogonality in their errors; that is, many of the samples misrecognized by one of the algorithms are correctly classified by the other one. Only relatively few samples (4 percent) are missed by both the symbolic and the neural classifiers. This orthogonality opens the possibility of a cooperative use of the two methods according to a parallel combination scheme in order to build a system able to achieve a significantly better performance than each of two methods taken singularly. In a parallel combination scheme [34], [35], each classifier is trained separately and all the classifiers perform their task in parallel. The output of the classifiers is then fed into a combiner, which on the basis of a suitably chosen combining rule, decides which class the input sample should be assigned to. By devising a proper combining rule, such a scheme could exploit the different abilities of the two approaches in order to circumvent the weak points of each single algorithm. In the ideal case, the classification error could be cut down to the fraction of samples that none of the two classifiers is able to recognize, achieving a significant improvement with respect to the best performance of a single approach (the error rate would be cut from 10 percent to 4 percent). Of course, the achievement of such a performance gain is not easy: The main problem is the construction of a combiner able to recognize which of the two classifiers provides the more reliable answer on the sample at hand. The evaluation of the reliability can be done with good results using the methodology presented in [36] for the neural network; we

are working on a similar technique for our symbolic algorithm.

7

CONCLUSIONS

AND

PERSPECTIVES

In this paper, we have presented a novel method for learning structural descriptions from examples based on a formulation of the learning problem in terms of Attributed Relational Graphs. Our method, like learning methods based on first-order logic, produces general prototypes which are easy to understand and to manipulate, but it is based on simpler operations (graph editing and graph matching) leading to a smaller overall computational cost. A preliminary experimentation has been conducted, which seems to confirm our claims about the advantages of our method. Then, the results obtained by our method have been compared with those of a connectionist system (a neural network) especially devised for working with structured information, which has been trained and tested using the same data sets. While the overall performances of the two systems are similar, an interesting fact emerged from the experimentation is that they show a significant amount of orthogonality with respect to their classification errors, in the sense that samples misclassified by one systems are often correctly recognized by the other. This orthogonality can be exploited by a suitable combining scheme using the techniques described in a previous section, in order to obtain a more powerful system, which is able to retain the strengths of both its constituents, while overcoming their weaknesses; the resulting system will exhibit a classification performance significantly higher than each of the two basic approaches taken singularly. The synergic integration between the symbolic and connectionist approaches is a promising technique, although there are still many issues that must be resolved in order to uncover the potential advantages in an effective

194

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

way. The parallel combination scheme previously discussed seems to be the more appropriate and more thoroughly understood form of integration for classification problems. However, other forms of integration may be more suitable to other kinds of problems. We are also starting a preliminary investigation of the theoretical and practical issues regarding the definition and the implementation of a more general framework for the exploitation of the possible synergies between the two approaches. One possibility we are examining entails the construction of a hierarchical system with different levels of structured description, which can be managed by different symbolic or connectionist subsystems. In this case, a characterization of the peculiarities of each level would be used to guide the choice of the best approach for that level. An important factor would be the kind of variability exhibited by the description. For instance, symbolic methods usually select a subset of distinctive features which are incorporated in the prototype definition; hence, they are able to withstand significant variations in the parts that are not considered significant, but will usually show very little robustness with respect to noise affecting the distinctive features. On the other hand, connectionist methods are less prone to errors when the sample are affected by noise, but may prove unable to identify unimportant features as such and to completely ignore them. In a hierarchical system, it may even be possible for the use of a connectionist approach to build or to preprocess a description which is then fed to a symbolic system, or vice versa, in such a way that each subsystem can filter out the kind of variability it is better suited to cope with. This idea seems very promising, even though its realization still presents some open questions.

REFERENCES [1] [2]

[3] [4]

[5]

[6] [7]

[8]

[9]

T. Pavlidis, Structural Pattern Recognition. New York: Springer, 1977. J. Rocha and T. Pavlidis, ªA Shape Analysis Model with Applications to a Character Recognition System, º IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 16, no. 4, pp. 393±404, Apr. 1994. H. Nishida, ªShape Recognition by Integrating Structural Descriptions and Geometrical/Statistical Transforms, º Computer Vision and Image Understanding, vol. 64, pp. 248±262, 1996. B.T. Messmer and H. Bunke, ªA New Algorithm for ErrorTolerant Subgraph Isomorphism Detection,º IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 20, no. 5, pp. 493±504, May 1998. L.P. Cordella, P. Foggia, R. Genna, and M. Vento, ªPrototyping Structural Descriptions: An Inductive Learning Approach,º Advances in Pattern Recognition, Lecture Notes in Computer Science, no. 1451, pp. 339±348, 1998. R.S. Michalski, ªPattern Recognition as Rule-Guided Inductive Inference,º IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 2, no. 4, pp. 349±361, July 1980. R.S. Michalski, ªA Theory and Methodology of Inductive Learning,º Machine Learning: An Artificial Intelligence Approach, R.S. Michalski, J.S. Carbonell, and T.M. Mitchell, eds., vol. 1, chapter 4, pp. 83±133, 1983. T.G. Dietterich, R.S. Michalski, ªA Comparative Review of Selected Methods for Learning from Examples,º Machine Learning: An Artificial Intelligence Approach, R.S. Michalski, J.S. Carbonell, and T.M. Mitchell, eds., vol. 1, chapter 3, pp. 41±82, 1983. S. Muggleton, ªInductive Logic Programming,º New Generation Computing, vol. 8, no. 4, pp. 295±318, 1991.

VOL. 13,

NO. 2,

MARCH/APRIL 2001

[10] A. Pearce, T. Caelly, and W.F. Bischof, ªRulegraphs for Graph Matching in Pattern Recognition,º Pattern Recognition, vol. 27, no. 9, pp. 1231±1247, 1994. [11] A. Chianese, L.P. Cordella, M. De Santo, and M. Vento, ª Decomposition of Ribbon-Like Shapes,º Proc. Sixth Scandinavian Conf. Image Analysis, pp. 416±423, 1989. [12] G.E. Hinton, ªMapping Part-Whole Hierarchies into Connectionist Networks,º Artificial Intelligence, vol. 46, pp. 47±75, 1990. [13] T.A. Plate, ªHolographic Reduced Representations,º IEEE Trans. Neural Networks, vol. 6, no. 3, pp. 623±641, May 1995. [14] P. Frasconi, M. Gori, M. Maggini, and G. Soda, ªUnified Integration of Explicit Knowledge and Learning by Example in Recurrent Networks,º IEEE Trans. Knowledge and Data Eng., vol. 7, no. 2, pp. 340±346, Apr. 1995. [15] A. Sperduti, ªStability Properties of Labeling Recursive AutoAssociative Memories,º IEEE Trans. Neural Networks, vol. 6, no. 6, pp. 1452±1460, Nov. 1995. [16] A. Sperduti and A. Starita, ªSupervised Neural Networks for the Classification of Structures,º IEEE Trans. Neural Networks, vol. 8, no. 3, pp. 714±735, May 1997. [17] P. Frasconi, M. Gori, and A. Sperduti, ªA General Framework for Adaptive Processing of Data Structures,º IEEE Trans. Neural Networks, vol. 9, no. 5, pp. 768±785, Sept. 1998. [18] J.B. Pollack, ªRecursive Distributed Representations,º Artificial Intelligence, vol. 46, pp. 77±106, 1990. [19] P. Smolensky, ªTensor Product Variable Binding and the Representation of Symbolic Structures in Connectionist Systems,º Artificial Intelligence, vol. 46, pp. 159±216, 1990. [20] D.S. Touretzky, ªDynamic Symbol Structures in a Connectionist Network,º Artificial Intelligence, vol. 42, no. 3, pp. 5±46, 1990. [21] M. Botta, A. Giordana, and L. Saitta, ªLearning Fuzzy Concept Definitions,º Proc. IEEE Int'l Conf. Fuzzy Systems, pp. 18±22, 1993. [22] J.R. Quinlan, ªLearning Logical Definitions from Relations,º Machine Learning, vol. 5, no. 3, pp. 239±266, 1993. [23] N. Lavrac and S. Dzerosky, Inductive Logic Programming: Techniques and Applications. Ellis Horwood, 1994. [24] T.M. Mitchell, ªGeneralization as Search,º Artificial Intelligence, vol. 18, no. 2, pp. 203±226, 1982. [25] T.M. Mitchell, ªThe Need for Biases in Learning Generalizations,º Technical Report CMB-TR-117, Dept. of Computer Science, Rutgers Univ., New Brunswick, New Jersey, 1980. [26] Readings in Machine Learning, J. Shavlik and T. Dietterich, eds., Morgan Kaufmann, 1990. [27] S.J. Russell and P. Norvig, Artificial Intelligence: A Modern Approach. Prentice-Hall, 1995. [28] P.H. Winston, ªLearning Structural Descriptions from Examples,º Technical Report MAC-TR-76, Dept. of Electrical Engineering and Computer Science, MIT, 1970. [29] L.P. Cordella, P. Foggia, C. Sansone, and M. Vento, ª Subgraph Transformations for the Inexact Matching of Attributed Relational Graphs,º Computing, vol. 12, pp. 43±52, 1998. [30] J.R. Ullmann, ªAn Algorithm for Subgraph Isomorphism,º J. Assoc. for Computing Machinery, vol. 23, pp. 31±42, 1976. [31] M. Bongard, Pattern Recognition. Hayden Book Company (Spartan Books), 1970. [32] L.P. Cordella, P. Foggia, C. Sansone, and M. Vento, ª Performance Evaluation of the VF Graph Matching Algorithm,º Proc. 10th Int'l Conf. Image Analysis Processing, 1999. [33] M. Minsky, ªLogical vs. Analogical or Symbolic vs. Connectionist or Neat vs. Scruffy,º Artificial Intelligence at MITÐExpanding Frontiers, P.H. Winston, ed., vol. 1, 1990, reprinted in AI Magazine, 1991. [34] I. Bloch, ªInformation Combination Operators for Data Fusion: A Comparative Review with Classification,º IEEE Trans. Systems, Man, and CyberneticsÐPart A, vol. 26, no. 1, pp. 52±76, 1996. [35] T.K. Ho, J.J. Hull, and S.N. Srihari, ªDecision Combination in Multiple Classifier Systems,º IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 16, no. 1, pp. 66±75, Jan. 1994. [36] L.P. Cordella, C. Sansone, F. Tortorella, M. Vento, and C. De Stefano, ªNeural Network Classification Reliability: Problems and Applications,º invited contribution in Image Processing and Pattern Recognition, Neural Network Systems Techniques and Applications, vol. 5, pp. 161±200, 1998. [37] N. Lavrac and S. Dzeroski, Inductive Logic Programming: Techniques and Applications. Ellis Horwood, 1994.

FOGGIA ET AL.: SYMBOLIC VS. CONNECTIONIST LEARNING: AN EXPERIMENTAL COMPARISON IN A STRUCTURED DOMAIN

Pasquale Foggia received the Laurea degree with honors in computer engineering in 1995 and the PhD degree in computer engineering in 1999, both at the University of Naples ªFederico II.º He is currently an assistant professor at the Dipartimento di Informatica e Sistemistica of the University of Naples ªFederico II.º His research interests are in the fields of classification algorithms, optical character recognition, graph matching, and inductive learning. He is a member of the International Association for Pattern Recognition (IAPR). Roberto Genna received the Laurea degree with honors in computer engineering from the University of Naples ªFederico IIº in 1997. He is currently a PhD student at the Dipartimento di Informatica e Sistemistica of the University of Naples ªFederico II.º His present research interests are in the fields of artificial intelligence, machine learning, inductive learning, inductive logic programming, image analysis, and pattern recognition. He is a member of the International Association for Pattern Recognition (IAPR).

195

Mario Vento received the Laurea degree (cum laude) in electronic engineering and, in 1988, the PhD degree in electronic and computer engineering, both from University of Naples ªFederico II,º Italy. Since 1989, he has been an assistant professor at the Dipartimento di Informatica e Sistemistica in the Faculty of Engineering of the University of Naples, but currently he is an associate professor of computer science and artificial intelligence. His interests involve basic research in the areas of artificial intelligence, image analysis, pattern recognition, machine learning, and parallel computing in artificial vision. He is especially interested in classification techniques, either statistical, syntactic, and structural, giving contributions to neural network theory, statistical learning, exact and inexact graph matching, multiexpert classification, and learning methodologies for structural descriptions. He participated in several projects in the areas of handwritten character recognition, document processing, car plate recognition, signature verification, raster to vector conversion of technical drawings, and automatic interpretation of biomedical images. He authored more than 70 research papers in international journals and conference proceedings. Dr. Vento is a member of the International Association for Pattern Recognition (IAPR) and of IAPR Technical Committee on ªGraph Based Representationsº (TC15).

Recommend Documents