Learning and deduction in neural networks and logic Ekaterina Komendantskaya Department of Mathematics, University College Cork, Ireland
Abstract We demonstrate the interplay between learning and deductive algorithms in logic and neural networks. In particular, we show how learning algorithms recognised in neurocomputing can be used to built connectionist neural networks which simulate the work of semantic operator for logic programs with uncertainty and conventional algorithm of SLD-resolution. Key words: Logic programs, SLD-resolution, Uncertain Reasoning, Connectionism, Neurocomputing
1
Introduction
“The human mind... infinitely surpasses the powers of any finite machine”. This manifest by Kurt G¨odel proclaimed in his Gibbs lecture to the American Mathematical Society in 1951, still challenges researchers in many areas of computer science to built a device that will prove to be more powerful than human brain. As advocated in [1], G¨odel specifically claimed that Turing overlooked that the human mind “in its use, is not static but constantly developing”. This particular concern of G¨odel found a further development in connectionism and neurocomputing. Connectionism is a movement in the fields of artificial intelligence, cognitive science, neuroscience, psychology and philosophy of mind which hopes to explain human intellectual abilities using artificial neural networks. Neural networks are simplified models of the brain composed of large numbers of units (the analogs of neurons) together with weights that measure the strength of connections between the units. Neural networks demonstrate an ability to Email address:
[email protected] (Ekaterina Komendantskaya).
Preprint submitted to Elsevier
22 October 2006
learn such skills as face recognition, reading, and the detection of simple grammatical structure. We will pay special attention to one of the topics of connectionism, namely, the so-called topic of neuro-symbolic integration, which investigates ways of integration of logic and formal languages with neural networks in order to better understand the essence of symbolic (deductive) and human (creative) reasoning, and to show interconnections between them. Neurocomputing is defined in [2] as the technological discipline concerned with information processing systems (neural networks) that autonomously develop operational capabilities in adaptive response to an information environment. Moreover, neurocomputing is often seen as alternative to programmed computing: the latter is based on the notion of some fixed algorithm (a set of rules) which must be performed by a machine; the former does not necessary require an algorithm or rule development. Note that in particular neurocomputing is specialised on creation of machines implementing neural networks, among them are digital, analog, electronic, optical, electro-optic, acoustic, mechanical, chemical and some other types of neurocomputers, see [2] for further details about physical realizations of neural networks. In this paper we will use methods and achievements of both connectionism and neurocomputing: namely, we take the connectionist neural networks of [3] as a starting point, and show the ways of how they can be enriched by learning functions and algorithms implemented in neurocomputing. The main question we address is how learning in neural networks and deduction in logic can complement each other. The main question splits into two particular questions: Is there something in deductive, automated reasoning that corresponds (or may be brought into correspondence) to learning in neural networks? Can first-order deductive mechanisms be realized in neural networks, and if they can, will resulting neural networks be learning or static? First we fix precise meanings of terms “deduction” and “learning”. Deductive reasoning is inference in which the conclusion is necessitated by, or deduced from, previously known facts or axioms. Notion of learning is often seen as opposite to the notion of deduction. We adopt the definition of learning from neurocomputing, see [4]. Learning is a process by which the free parameters of a neural network are adapted through a continuing process of simulation by the environment in which the network is embedded. The type of learning is determined by the manner in which the parameter changes take place. The definition of learning process implies the following sequence of events: • The neural network is stimulated by an environment. • The neural network undergoes changes as a result of this simulation. 2
• The neural network responds in a new way to the environment, because of the changes that have occurred in its internal structure. In this paper we will borrow techniques from both supervised and unsupervised learning paradigms, and in particular we will use such algorithms, as errorcorrection learning (see Section 3.2.1), Hebbian learning (see Section 2.2), competitive learning (see 3.2.3) and filter learning (see Section 3.2.2). The field of neuro-symbolic integration is stimulated by the fact that formal theories (as studied in mathematical logic and used in automated reasoning) are commonly recognised as deductive systems which lack such properties of human reasoning, as adaptation, learning and self-organisation. On the other hand, neural networks, introduced as a mathematical model of neurons in human brains, claim to possess all of the mentioned abilities, and moreover, they provide parallel computations and hence can perform certain calculations faster than classical algorithms. As a step towards integration of the two paradigms, there were built connectionist neural networks [3] which can simulate the work of semantic operator for propositional and (function-free) first-order logic programs. Those neural networks, however, were essentially deductive and could not learn or perform any form of self-organisation or adaptation; they could not even make deduction faster or more effective. There were several attempts to make these neural networks more “intelligent”, see, for example, [5], [6], [7] for some further developments. In this paper we propose the two ways of how connectionist neural networks of [3] can be brought closer to original ideas of connectionism and neurocomputing. The first way shows that enriched connectionist neural networks of [3] built to simulate the logic programs for reasoning with uncertainty incorporate at least two learning functions and thus posses ability of unsupervised learning, see Section 2 for further details. The second way employs the use of SLD-resolution, and in particular, we build neural networks which can simulate the work of SLD-resolution algorithm, and show that these neural networks have several learning functions, and thus perform different types of supervised and unsupervised learning recognised in neurocomputing.
1.1
Logic Programming and Neural Networks
In the following two subsections we briefly overview the background definitions, notions and theorems concerning logic programming, neural networks and neuro-symbolic integration. The notation and the main results we overview in this section will be assumed throughout the paper. 3
1.1.1
Logic Programs and their Fixpoint Semantics
We fix a first-order language L consisting of constant symbols a1 , a2 , . . ., variables x1 , x2 , . . ., function symbols f1 , f2 , . . ., predicate symbols Q1 , Q2 , . . ., connectives ¬, ∧, ∨ and quantifiers ∀, ∃. We follow the conventional definition of a term and a formula. Namely, constant symbol is a term, variable is a term, and if fi is a function symbol and t is term, then fi (t) is a term. Let Qi be a predicate symbol and t be a term. Then Qi (t) is a formula (also called an atomic formula or an atom). If F1 , F2 are formulae, than ¬F1 , F1 ∧ F2 , F1 ∨ F2 , ∀xF1 and ∃xF1 are formulae. Formula of the form ∀x(A ∨ ¬B1 ∨ . . . ¬Bn ), where A is an atom and each Bi is either an atom or a negation of an atom is called a Horn clause. A Logic Program P is a set of Horn clauses, and it is common to use the notation A ← B1 , . . . , Bn for Horn clauses of the form ∀x(A ∨ ¬B1 ∨ . . . ¬Bn ), see [8] for further details. If each Bi is positive, then we call the clause definite. The logic program that contains only definite clauses is called a definite logic program. Example 1 This is a simple example of a definite logic program. Q1 (a1 ) ← Q2 (a1 ) ← Q3 (a1 ) ← Q1 (a1 ), Q2 (a1 )
The Herbrand Universe UP for a logic program P is the set of all ground terms which can be formed out of symbols appearing in P . The Herbrand Base BP for a logic program P is the set of all ground atoms which can be formed by using predicate symbols appearing in P with terms from UP . Let I be an interpretation of P . Then an Herbrand Interpretation I for P is defined as follows: I = {Qi (t1 , . . . , tn ) ∈ BP : Qi (t1 , . . . , tn ) is true with respect to I}. Semantic operator TP was first introduced in [9], see also [8], as follows: TP (I) = {A ∈ BP : A ← B1 , . . . , Bn is a ground instance of a clause in P and {B1 , . . . , Bn } ⊆ I}. The semantic operator provides the Least Herbrand Model characterisation for each definite logic program. Define TP ↑ 0 = ∅; TP ↑ α = TP (TP (α − 1)), S if α is a successor ordinal; and TP ↑ α = (TP ↑ β : β ≤ α), if α is a limit ordinal. The following theorem establishes correspondence between the Least Herbrand Model of P and iterations of TP . Theorem 2 [9] Let MP denote the Least Herbrand Model of a definite logic 4
program P . Then MP = lfp(TP ) = TP ↑ ω. Example 3 The following is the Least Herbrand Model for the logic program P from Example 1: TP ↑ 2 = {Q1 (a1 ), Q2 (a1 ), Q3 (a1 )}.
1.1.2
Connectionist Neural Networks
We follow the definitions of a connectionist neural network given in [3], see also [5] and [10] for further developments of the connectionist neural networks. A connectionist network is a directed graph. A unit k in this graph is characterised, at time t, by its input vector (vi1 (t), . . . vin (t)), its potential pk (t), its threshold Θk , and its value vk (t). Note that in general, all vi , pi and Θi , as well as all other parameters of neural network can be performed by different types of data, the most common of which are real numbers, rational numbers [3], fuzzy (real) numbers, complex numbers, numbers with floating point, and some others, see [2] for more details. In the two major sections of this paper we use rational numbers (Section 2) and G¨odel (integer) numbers (Section 3). Units are connected via a set of directed and weighted connections. If there is a connection from unit j to unit k, then wkj denotes the weight associated with this connection, and vk (t) = wkj vj (t) is the input received by k from j at time t. The units are updated synchronously. In each update, the potential and value of a unit are computed with respect to an activation and an output function respectively. Most units considered in this paper compute their potential as the weighted sum of their inputs minus their threshold:
pk (t) =
nk X
wkj vj (t) − Θk .
j=1
The units are updated synchronously, time becomes t + ∆t, and the output value for k, vk (t + ∆t) is calculated from pk (t) by means of a given output function F , that is, vk (t + ∆t) = F (pk (t)). For example, the output function we most often use in this paper is the binary threshold function H, that is, vk (t + ∆t) = H(pk (t)), where H(pk (t)) = 1 if pk (t) > 0 and 0 otherwise. Units of this type are called binary threshold units. Example 4 Consider two units, j and k, having thresholds Θj , Θk , potentials pj , pk and values vj , vk . The weight of the connection between units j and k is denoted wkj . Then the following graph shows a simple neural network consisting of j and k. The neural network receives input signals v 0 , v 00 , v 000 and sends an output signal vk . 5
v 0 KKK
pj
v 000
j
KK % ONML / HIJK 00 Θj v tt9 t t t
wkj
pk
HIJK / ONML Θk /vk
k
We will mainly consider connectionist networks where the units can be organised in layers. A layer is a vector of units. An n-layer feedforward network F consists of the input layer, n − 2 hidden layers, and the output layer, where n ≥ 2. Each unit occurring in the i-th layer is connected to each unit occurring in the (i + 1)-st layer, 1 ≤ i < n. The next theorem establishes a correspondence between the semantic operator TP defined for definite logic programs and the three-layer connectionist neural networks. Theorem 5 [3] For each definite propositional logic program P , there exists a 3-layer feedforward network computing TP . PROOF. The core of the proof is the translation algorithm from a logic program P into a corresponding neural network, which can be briefly described as follows. • The input and output layer are vectors of binary threshold units of length m, where the i-th unit in the input and output layer represents the i-th proposition. All units in the input and output layers have thresholds equal to 0.5. • For each clause A ← B1 , . . . , Bn , do the following. · Add a binary threshold unit c to the hidden layer. · Connect c to the unit representing A in the output layer with weight 1. · For each formula Bj , 1 ≤ j ≤ n, connect the unit representing Bj to c with the weight 1. · Set the threshold Θc of c to l − 0.5, where l is the number of atoms occurring in B1 , . . . , Bn . For each proposition A, connect unit representing A in the output layer with the unit representing A in the input layer via connection with weight 1. 2 This theorem easily generalises to the case of function-free first-order logic programs, because the Herbrand Base for these logic programs is finite and hence it is possible to work with finite number of ground atoms instead of propositions. However, if we wish to use the construction of Theorem 5 to compute the semantic operator for conventional first-order logic programs, we will need to use some kind of approximation theorems as in [10],[11] or [12] to 6
give an account of cases when the Herbrand Base is infinite, and hence infinite number of neurons is needed to simulate the semantic operator for the logic programs. An essential property of the neural networks constructed in the proof of Theorem 5 is that number of iterations of TP (for a given logic program P ) corresponds to the number of iterations of the neural network built upon P . Example 6 Consider the logic program from Example 1. The following neural network computes TP ↑ 2 in two iterations.
Q1 (a1 )Q2 (a1 )Q3 (a1 ) ?>=< 89:; 0.5 O
?>=< 89:; 0.5 O
?>=< 89:; 0.5 O
ONML HIJK ?>=< 89:; HIJK ONML −0.5 −0.5 E x;1.5 x x
xx
xx
x x xx
xxx
89:; ?>=< 89:; ?>=< 89:; ?>=< 0.5 0.5 0.5
Q1 (a1 )Q2 (a1 )Q3 (a1 ) There are several immediate conclusions to be made from this introductory section. We have described a procedure of how the Least Herbrand model of a function-free logic program can be computed by a three-layer feedforward connectionist neural network. However, the neural networks do not possess some essential and most beneficial properties of artificial neural networks, such as learning and self-adaptation, ability to perform parallel computations. On the contrary, the connectionist neural networks we have described are essentially deductive procedure, which works no faster than conventional resolution for logic programs. Moreover, the connectionist neural networks depend on ground instances of clauses, and in case of logic programs containing function symbols will require infinitely long layers to compute the least fixed point of TP . This property does not agree with the very idea of neurocomputing, which advocates another principle of computation: effectiveness of both natural and artificial neural networks depends primary on their architecture, which is finite, but allows very sophisticated and “well-trained” interconnections between neurons. (Note that in human brains the overall number of neurons constantly decreases during the lifetime, but the neurons which remain active are able to increase and optimise interconnections with other neurons, see [4] for more details.) The way of approximation supports the opposite direction of increasing the number of units in the artificial neural network (in order to approximate its infinite counterpart) rather than of improving connections 7
within a given neural network.
Recall that the original goal of connectionism was to combine the most beneficial properties of automated reasoning and neural networks in order to built a more effective an “wiser” algorithms. The natural questions is: why the connectionist neural networks of [3] do not possess the essential properties of conventional neural networks and what are the ways to improve them and bring learning, self-organisation and parallelism into their computations? There may be two alternative answers to the question.
1. The connectionist neural networks we have described simulate classical twovalued deductive system, and they inherit purely deductive way of processing a given database from classical logic. If this is the case, we can adapt the algorithm of building a connectionist neural network to non-classical logic programs. And different kinds of logics programs for reasoning with uncertainties and incomplete/inconsistent databases are natural candidates for this. In Section 2 we show that connectionist neural networks computing the Least Herbrand Model for a given bilattice-based annotated logic program will perform two kinds of unsupervised learning - such as Hebbian and Anti-Hebbian learning. These neural networks, however, still require approximation theorems when working with logic programs containing function symbols.
This problem led us to yet another answer:
2. It may be the choice of TP operator which led us in the wrong direction. What if we try to simulate classical SLD-resolution by connectionist neural networks? The main benefit of it is that we will not need to work with infinite neural network architectures anymore. And we introduce an algorithm of constructing neural networks simulating the work of SLD-resolution in Section 3. In fact, the neural networks of this type possess many more useful properties than just finite architecture. In particular, we show that these neural networks incorporate six learning functions which are recognised in neurocomputing, and can perform parallel computations for certain types of program goals.
We conclude the paper by discussion of benefits of two approaches and possibilities of building neural networks simulating SLD-resolution for logic programs with uncertainties. 8
2
2.1
Reasoning with Uncertainty and Neural Computations
Bilattice-based Logic Programs
The notion of a bilattice was introduced in 80’s as a generalisation of the famous Belnap’s lattice (NONE ≤ TRUE ≤ BOTH; NONE ≤ FALSE ≤ BOTH) as a suitable structure for interpreting different languages and programs working with uncertainties and incomplete or inconsistent databases, see [13], [14], [15], [16],[17] for further details and further motivation. Definition 7 [13] A bilattice B is a sextuple (B, ∨, ∧, ⊕, ⊗, ¬) such that (B, ∨, ∧) and (B, ⊕, ⊗) are both complete lattices, and ¬ : B → B is a mapping satisfying the following three properties: ¬2 = IdB , ¬ is a dual lattice homomorphism from (B, ∨, ∧) to (B, ∧, ∨), and ¬ is the lattice homomorphism from (B, ⊕, ⊗) to itself. Note that the lattice (B, ∨, ∧) is traditionally thought of generalising the Boole lattice (FALSE,TRUE), and describing measures of truth and falsity. The lattice (B, ⊕, ⊗) is thought of as measuring amount of information (or knowledge) between NONE and BOTH. We use here the fact that each distributive bilattice can be regarded as a product of two lattices, see [15]. Therefore, we consider only logic programs over distributive bilattices and regard the underlying bilattice of any program as a product of two lattices. Moreover, we always treat each bilattice we work with as isomorphic to some subset of B = L1 × L2 = ([0, 1], ≤) × ([0, 1], ≤), where [0, 1] is the unit interval of reals with the linear ordering defined on it. Elements of such a bilattice are pairs: the first element of each pair denotes evidence for a fact, and the second element denotes evidence against it. Thus, (1, 0) and (0, 1) are the analogues of TRUE and FALSE and are maximal respectively minimal in the truth ordering, while (1, 1) (or BOTH) and (0, 0) (or NONE) are respectively maximal and minimal elements in the knowledge ordering. We define an annotated bilattice-based language L to consist of individual variables, constants, functions and predicate symbols as described in Section 1.1.1, together with annotation terms which can consist of variables, constants and/or functions over a bilattice. Bilattice-based languages allow, in general, six connectives and four quantifiers, as follows: ⊕, ⊗, ∨, ∧, ¬, ∼, Σ, Π, ∃, ∀. But in this paper we restrict our attention to bilattice-based logic programs and will work explicitly only with ⊕, ⊗, Σ, the latter being existential quantifier with respect to the knowledge ordering. 9
An annotated formula is defined inductively as follows: if R is an n-ary predicate symbol, t1 , . . . , tn are terms, and (µ, ν) is an annotation term, then R(t1 , . . . , tn ) : (µ, ν) is an annotated formula (called an annotated atom). Annotated atoms can be combined to form complex formulae using the connectives and quantifiers. A bilattice-based annotated logic program (BAP) P consists of a finite set of annotated program clauses of the form A : (µ, ν) ← L1 : (µ1 , ν1 ), . . . , Ln : (µn , νn ), where A : (µ, ν) denotes an annotated atom called the head of the clause, and L1 : (µ1 , ν1 ), . . . , Ln : (µn , νn ) denotes L1 : (µ1 , ν1 ) ⊗ . . . ⊗ Ln : (µn , νn ) and is called the body of the clause; each Li : (µi , νi ) is an annotated literal called an annotated body literal of the clause. Individual and annotation variables in the body are thought of as being existentially quantified using Σ. In [18], we showed how the remaining connectives ⊕, ∨, ∧ can be introduced into BAPs. Each annotated atom A : (µ, ν) is interpreted in two steps as follows: the firstorder atomic formula A is interpreted in B (we may also write it as IB (A) → B) using the domain of interpretation and variable assignment, see [15], [14], [17], [18] for further details. Then we define the interpretation IB as follows: if IB (A) ≥ (µ, ν), we put IB (A : (µ, ν)) = (1, 0); and IB (A : (µ, ν)) = (0, 1) otherwise. Let IB be an interpretation for L and let F be a closed annotated formula of L. Then IB is a model for F if IB (F ) = (1, 0). We say that IB is a model for a set S of annotated formulae if IB is a model for each annotated formula of S. We say that F is a logical consequence of S if, for every interpretation IB of L, IB is a model for S implies IB is a model for F . Let BP and UP denote an annotation Herbrand base respectively Herbrand universe for a program P , which are essentially BP and UP defined in Section 1.1.1, but with annotation terms allowed in UP and attached to ground formulae in BP . In common with conventional logic programming, each Herbrand interpretation HI for P can be identified with the subset {R(t1 , . . . , tk ) : (α, β) ∈ BP |R(t1 , . . . , tk ) : (α, β) receives the value (1, 0) with respect to IB } of BP it determines, where R(t1 , . . . , tk ) : (α, β) denotes a typical element of BP . This set constitutes an annotation Herbrand model for P . Finally, we let HIP,B denote the set of all annotation Herbrand interpretations for P . It was noticed in [16], [19],[18],[20],[21] that non-linear ordering of (bi)lattices influence both model properties and proof procedures for (bi)lattice-based logics, and this distinguishes them from classical and even fuzzy logic. In particular, both semantic operator and SLD-resolution for BAPs must reflect the non-linear ordering of bilattices, see [18], [21], [22]. 10
In [18], we introduced a semantic operator TP for BAPs, proved its continuity and showed that it computes at its least fixed point the least Herbrand model for a given BAP: Definition 8 We define the mapping TP : HIP,B → HIP,B as follows: TP (HI) denotes the set of all A : (µ, ν) ∈ BP such that either (1) There is a strictly ground instance of a clause A : (µ, ν) ← L1 : (µ1 , ν1 ), . . . , Ln : (µn , νn ), such that {L1 : (µ01 , ν10 ), . . . , Ln : (µ0n , νn0 )} ⊆ HI for some annotations (µ01 , ν10 ), . . . , (µ0n , νn0 ), and one of the following conditions holds for each (µ0i , νi0 ): (a) (µ0i , νi0 ) ≥k (µi , νi ), (b) (µ0i , νi0 ) ≥k ⊗j∈Ji (µj , νj ), where Ji is the finite set of those indices i, j ∈ {1, . . . n} such that Lj = Li or (2) there are annotated strictly ground atoms A : (µ∗1 , ν1∗ ), . . . , A : (µ∗k , νk∗ ) ∈ HI such that (µ, ν) ≤k (µ∗1 , ν1∗ ) ⊕ . . . ⊕ (µ∗k , νk∗ ). 1 The item 1a is the analogue of TP operator defined in Section 1.1.1. Items 1b and 2 reflect properties of the bilattice structure, as further illustrated in the next example. Example 9 Consider a bilattice-based annotated logic program P which can collect and process information about connectivity of some (probabilistic) graph G. Suppose we have received information from two different sources: one reports that there is an edge between nodes a and b, the other - that there is no such. This is represented by the two unit clauses edge(a, b) : (1, 0) ←, edge(a, b) : (0, 1) ←. It will be reasonable enough to conclude that the information is contradictory, that is, to conclude that edge(a, b) : (1, 1). And this is captured in item 2. If, on the other hand, the program contains some clause of the form disconnected(G) : (1, 1) ← connected(a, c) : (1, 0), connected(a, c) : (0, 1); we may regard the clause disconnected(G) : (1, 1) ← connected(a, c) : (0, 0) as equally true. And this is captured in item 1b. Let B, A and C denote, respectively, edge(a, b), connected(a, c) and disconnected(G). Consider the logic program: B : (0, 1) ←, B : (1, 0) ←, A : (0, 0) ← B : (1, 1), C : (1, 1) ← A : (1, 0), A : (0, 1). The least fixed point of TP is TP ↑ 3 = {B : (0, 1), B : (1, 0), B : (1, 1), A : (0, 0), C : (1, 1)}. However, the item 1.a (corresponding to the classical semantic operator) would allow us to compute only TP ↑ 1 = {B : (0, 1), B : (1, 0)}, that is, only explicit consequences of a program, which then leads to a contradiction in the twovalued case. Note that whenever F : (µ, ν) ∈ HI and (µ0 , ν 0 ) ≤k (µ, ν), then F : (µ0 , ν 0 ) ∈ HI. Also, for each formula F , F : (0, 0) ∈ HI.
1
11
2.2
Neural Networks for Reasoning with Uncertainty
We extend the approach of [3] described in section 1.1.2 to learning neural networks which can compute logical consequences of BAPs. This will allow us to introduce hypothetical and uncertain reasoning into the framework of neural-symbolic computation. Bilattice-based logic programs can work with conflicting sources of information and inconsistent databases. Therefore, neural networks corresponding to these logic programs should reflect this facility as well, and this is why we introduce some forms of learning into neural networks. These forms of leaning can be seen as corresponding to unsupervised Hebbian learning which is widely implemented in neurocomputing. The general idea behind Hebbian learning is that positively correlated activities of two neurons strengthen the weight of the connection between them and that uncorrelated or negatively correlated activities weaken the weight of the connection (the latter form is known as Anti-Hebbian learning). The general conventional definition of Hebbian learning is given as follows, see [4] for further details. Let k and j denote two units and wkj denote a weight of the connection from j to k. We denote the value of j at time t as vj (t) and the potential of k at time t as pk (t). Then the rate of change in the weight between j and k is expressed in the form ∆wkj (t) = F (vj (t), pk (t)), where F is some function. As a special case of this formula, it is common to write ∆wkj (t) = η(vj (t))(pk (t)), where η is a constant that determines the rate of learning and is positive in case of Hebbian learning and negative in case of Anti-Hebbian learning. Finally, we update wkj (t + 1) = wkj (t) + ∆wkj (t). Example 10 Consider the neural network from Example 4. The Hebbian learning functions described above can be used to train the weight wkj . In this section, we will compare the two learning functions we introduce with this conventional definition of Hebbian learning. First, we prove a theorem establishing a relationship between learning neural networks and BAPs with no function symbols occurring in either individual or annotation terms. (Since the annotation Herbrand base for these programs is finite, they can equivalently be seen as propositional bilattice-based logic programs with no functions allowed in the annotations.) In the next subsection, we will extend the result to first-order BAPs with functions in individual and annotation terms. Theorem 11 For each function-free BAP P , there exists a 3-layer feedforward learning neural network which computes TP . 12
PROOF. Let m and n be the number of strictly ground annotated atoms from the annotation Herbrand base BP and the number of clauses occurring in P respectively. Without loss of generality, we may assume that the annotated atoms are ordered. The network associated with P can now be constructed by the following translation algorithm. (1) The input and output layers are vectors of binary threshold units of length k, 1 ≤ k ≤ m, where the i-th unit in the input and output layers represents the i-th strictly ground annotated atom. The threshold of each unit occurring in the input or output layer is set to 0.5. (2) For each clause of the form A : (α, β) ← B1 : (α1 , β1 ), . . . , Bm : (αm , βm ), m ≥ 0, in P do the following. 2.1 Add a binary threshold unit c to the hidden layer. 2.2 Connect c to the unit representing A : (α, β) in the output layer with weight 1. We will call connections of this type 1-connections. 2.3 For each atom Bj : (αj , βj ) in the input layer, connect the unit representing Bj : (αj , βj ) to c and set the weight to 1. (We will call these connections 1-connections also.) 2.4 Set the threshold θc of c to l − 0.5, where l is the number of atoms in B1 : (α1 , β1 ), . . . , Bm : (αm , βm ). 2.5 If some input unit representing B : (α, β) is connected to a hidden unit c, connect each of the input units representing annotated atoms B : (αi , βi ), . . . , B : (αj , βj ) to c. These connections will be called ⊗-connections. The weights of these connections will depend on a learning function. If the function is inactive, set the weight of each ⊗-connection to 0. (3) If there are units representing atoms of the form B : (αi , βi ), . . . , B : (αj , βj ) in input and output layers, correlate them as follows. For each B : (αi , βi ), connect the unit representing B : (αi , βi ) in the input layer to each of the units representing B : (αi , βi ), . . . , B : (αj , βj ) in the output layer. These connections will be called the ⊕-connections. If an ⊕-connection is set between two atoms with different annotations, we consider them as being connected via hidden units with thresholds 0. If an ⊕-connection is set between input and output units representing the same annotated atom B : (α, β), we set the threshold of the hidden unit connecting them to −0.5, and we will call them ⊕-hidden units, so as to distinguish the hidden units of this type. The weights of all these ⊕connections will depend on a learning function. If the function is inactive, set the weight of each ⊕-connection to 0. (4) Set all the weights which are not covered by these rules to 0. For each annotated atom A : (α, β), connect unit representing A : (α, β) in the output layer with the unit representing it in the input layer via weight 1. Allow two learning functions to be embedded into the ⊗ -connections and the ⊕ -connections. We let vi denote the value of the input unit representing 13
B : (αi , βi ) and pc denote the potential of the unit c. Let a unit representing B : (αi , βi ) in the input layer be denoted by i. If i is connected to a hidden unit c via an ⊗ -connection, then a learning function φ1 is associated to this connection. We let φ1 = ∆wci (t − 1) = (vi (t − 1))(−pc (t − 1) + 0.5) become active and change the weight of the ⊗-connection from i to c at time t if i became activated at time (t − 1), units representing atoms B : (αj , βj ), . . . , B : (αk , βk ) in the input layer are connected to c via 1connections and (αi , βi ) ≥k ((αj , βj ) ⊗ . . . ⊗ (αk , βk )). Function φ2 is embedded only into connections of type ⊕, namely, into ⊕connections between hidden and output layers. Let o be an output unit representing an annotated atom B : (αi , βi ). Apply φ2 = ∆woc (t − 2) = (vc (t − 2))(po (t − 2) + 1.5) to change woc at time t if φ2 is embedded into an ⊕connection from the ⊕-hidden unit c to o and there are output units representing annotated atoms B : (αj , βj ), . . . , B : (αk , βk ), which are connected to the unit o via ⊕-connections, these output units became activated at time t − 2 and (αi , βi ) ≤k ((αj , βj ) ⊕ . . . ⊕ (αk , βk )). Each annotation Herbrand interpretation HI for P can be represented by a binary vector (v1 , . . . , vm ). Such an interpretation is given as an input to the network by externally activating corresponding units of the input layer at time t0 . It remains to show that A : (α, β) ∈ TP ↑ n for some n if and only if the unit representing A : (α, β) becomes active at time t + 2, for some t. The proof that this is so proceeds by routine induction, see Appendix A. 2
Example 12 The following diagram displays the neural network which computes TP ↑ 3 from Example 9. Without functions φ1 , φ2 the neural network will compute only TP ↑ 1 = {B : (0, 1), B : (1, 0)}, explicit logical consequences of the program. But it is the use of φ1 and φ2 that allows the neural network / , _ _ _/ , / denote respectively to compute TP ↑ 3. Note that arrows 1-connections, ⊗-connections and ⊕-connections, and we have marked by φ1 , φ2 the connections which are activated by the learning functions. According to the conventional definition of feedforward neural networks, each output unit denoting some atom is in its turn connected to the input unit denoting the same atom via a 1-connection and this forms a loop. We assume but do not draw these connections here. 14
?>=< 89:; 0.5 O TY]
?>=< 89:; 0.5 JOTY
?>=< 89:; 0.5 EJOT
?>=< 89:; 0.5 AEJ O
?>=< 89:; 0.5 OP T ^
?>=< 89:; 0.5 J OP T
?>=< 89:; 0.5 J NO T
?>=< 89:; 0.5 A J NO
?>=< 89:; 0.5 O TY]
?>=< 89:; 0.5 JOTY
?>=< 89:; 0.5 EJOT
?>=< 89:; 0.5 AEJ O
89:; 89:; ?>=< 89:; ?>=< 89:; gfghj234 ?>=< ?>=< ?>=< 89:; 89:; ?>=< −0.5 −0.5 0.5 O O 2X 2Da Mf hQ MQ −0.5 O O O f gfgh−0.5 fgO hjghfghjh 1.5 g h g h f g 22 D MQ Q f gg hh j 22 D DM fMQ fQggfgghfghghfghghgjhhjh j φ1 2 f fggggMhhQh Qj j φ f g1fggfggh2g2h2ghghhDhjhDhjMjM Q Q f f g Q Q j D f hh M gg M f fggfggggghghhhhhhj j 2j Q f ?>=< f ?>=< 89:; 89:; ?>=< 89:; ?>=< 89:; ?>=< 89:; ?>=< 89:; ?>=< 89:; 89:; ?>=< 89:; ?>=< 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
89:; ?>=< −0.5
89:; ?>=< −0.5
89:; ?>=< −0.5
89:; ?>=< 0.5
89:; ?>=< 0.5
89:; ?>=< 0.5
A : (1, 1)A : (1, 0)A : (0, 1)A : (0, 0)B : (1, 1)B : (1, 0)B : (0, 1)B : (0, 0)C : (1, 1)C : (1, 0)C : (0, 1)C : (0, 0)
?>=< 89:; −0.5 O
89:; ?>=< −0.5 O
89:; ?>=< −0.5
φ2
φ2
O
O
O
O
A : (1, 1)A : (1, 0)A : (0, 1)A : (0, 0)B : (1, 1)B : (1, 0)B : (0, 1)B : (0, 0)C : (1, 1)C : (1, 0)C : (0, 1)C : (0, 0)
We can make several conclusions from the construction of Theorem 11. • Neurons representing annotated atoms with identical first-order (or propositional) components are joined into multineurons in which units are correlated using ⊕- and ⊗-connections. • The learning function φ2 roughly corresponds to Hebbian learning, with the rate of learning η2 = 1, the learning function φ1 corresponds to AntiHebbian learning with the rate of learning −1, and we regard η1 as negative because the factor pc in the formula for φ1 is multiplied by (−1). • The main problem Hebbian learning causes is that the weights of connections with embedded learning functions tend to grow exponentially, which cannot fit into the model of biological neurons. This is why traditionally some functions are introduced to bound the growth. In the neural networks we have built some of the weights may grow with iterations, but the growth will be very slow, because of the activation functions, namely, binary threshold functions, used in the computation of each vi .
2.3
Neural Networks and First-Order BAPs
In this section we will briefly describe the essence of the approximation result of [11], but we apply it to BAPs. Since neural networks of [3] were proven to compute the least fixed points of the semantic operator defined for propositional logic programs, many attempts have been made to extend this result to first-order logic programs. See, for example, [10], [11]. We extend here the result obtained by Seda [11] for twovalued first-order logic programs to first-order BAPs. Let l : BP → N be a level mapping with the property that, given n ∈ N, we can effectively find the set of all A : (µ, ν) ∈ BP such that l(A : (µ, ν)) = n. 15
The following definition is due to Fitting, and for further explanations see [11] or [18]. Definition 13 Let HIP,B be the set of all interpretations BP → B. We define the ultrametric d : HIP,B × HIP,B → R as follows: if HI1 = HI2 , we set d(HI1 , HI2 ) = 0, and if HI1 6= HI2 , we set d(HI1 , HI2 ) = 2−N , where N is such that HI1 and HI2 differ on some ground atom of level N and agree on all atoms of level less then N . Fix an interpretation HI from elements of the Herbrand base for a given program P to the set of values {(1, 0), (0, 1)}. We assume further that (1, 0) is encoded by 1 and (0, 1) is encoded by 0. Let HIP denote the set of all such interpretations, and take the semantic operator TP as in Definition 8. Let F denote a 3-layer feedforward learning neural network with m units in the input and output layers. The input-output mapping fF is a mapping fF : HIP,B → HIP,B defined as follows. Given HI ∈ HIP,B , we present the vector (HI(B1 : (α1 , β1 )), . . . , HI(Bm : (αm , βm ))), to the input layer; after propagation through the network, we determine fF (HI) by taking the value of fF (HI)(Aj : (αj , βj )) to be the value in the jth unit in the output layer, j = 1, . . . , m, and taking all other values of fF (HI)(Aj : (αj , βj )) to be 0. Suppose that M is a fixed point of TP . Following [11], we say that a family F = {Fi : i ∈ I} of 3-layer feedforward learning networks Fi computes M if there exists HI ∈ HIP such that the following holds: given any ε > 0, there is an index i ∈ I and a natural number mi such that for all m ≥ mi we have d(fim (HI), M ) < ε, where fi denotes fFi and fim (HI) denotes the mth iterate of fi applied to HI. Theorem 14 Let P be an arbitrary annotated program, let HI denote the least fixed point of TP and suppose that we are given ε > 0. Then there exists a finite program P = P (ε) (a finite subset of ground(P )) such that d(HI, HI) < ε, where HI denotes the least fixed point of TP . Therefore, the family {Fn |n ∈ N} computes HI, where Fn denotes the neural network obtained by applying the algorithm of Theorem 11 to Pn , and Pn denotes P (ε) with ε taken as 2−n for n = 1, 2, 3, . . .. This theorem clearly contains two results corresponding to the two separate statements made in it. The first concerns finite approximation of TP , and is a straightforward generalisation of a theorem established in [11]. The second is an immediate consequence of the first conclusion and Theorem 11. Thus, we have shown that the learning neural networks we have built can approximate the least fixed point of the semantic operator defined for first-order BAPs. 16
3
SLD-Resolution Performed in Connectionist Neural Networks
This section introduces an alternative extension of connectionist neural networks, namely, we adapt ideas of connectionism to SLD-resolution. The resulting neural networks have finite architecture, have learning abilities and can perform parallel computations for certain kinds of program goals. This brings connectionist neural networks closer to artificial neural networks implemented in neurocomputing, see [2]. Furthermore, the fact that classical first-order derivations require the use of learning mechanisms if implemented in neural networks is very interesting on its own right and suggests that firstorder deductive theories are in fact capable of acquiring some new knowledge, at least at the extent of how this process is understood in neurocomputing. In this section we use the notions of first-order language and logic programs defined in Section 1.1.1.
3.1
SLD-Resolution in Logic Programming
We briefly survey the notion of unification and SLD-resolution as it was introduced by [23] and [24], see also [8]. Let S be a finite set of atoms. A substitution θ is called a unifier for S if S is a singleton. A unifier θ for S is called a most general unifier (mgu) for S if, for each unifier σ of S, there exists a substitution γ such that σ = θγ. To find the disagreement set DS of S locate the leftmost symbol position at which not all atoms in S have the same symbol and extract from each atom in S the term beginning at that symbol position. The set of all such terms is the disagreement set. Unification algorithm: (1) Put k = 0 and σ0 = ε. (2) If Sσk is a singleton, then stop; σk is an mgu of S. Otherwise, find the disagreement set Dk of Sσk . (3) If there exist a variable v and a term t in Dk such that v does not occur in t, then put θk+1 = θk {v/t}, increment k and go to 2. Otherwise, stop; S is not unifiable. The Unification Theorem establishes that, for any finite S, if S is unifiable, then the unification algorithm terminates and gives an mgu for S. If S is not unifiable, then the unification algorithm terminates and reports this fact. Definition 15 Let a goal G be ← A1 , . . . , Am , . . . , Ak and a clause C be A ← 17
B1 , . . . , Bq . Then G0 is derived from G and C using mgu θ if the following conditions hold: • Am is an atom, called the selected atom, in G. • θ is an mgu of Am and A. • G0 is the goal ← (A1 , . . . , Am−1 , B1 , . . . , Bq , Am+1 , . . . , Ak )θ. An SLD-derivation of P ∪ {G} consists of a sequence of goals G = G0 , G1 , . . ., a sequence C1 , C2 , . . . of variants of program clauses of P and a sequence θ1 , θ2 , . . . of mgu’s such that each Gi+1 is derived from Gi and Ci+1 using θi+1 . An SLD-refutation of P ∪ {G} is a finite SLD-derivation of P ∪ {G} which has the empty clause 2 as the last goal of derivation. If Gn = 2, we say that refutation has length n. The success set of P is the set of all A ∈ BP such P ∪ {← A} has an SLD-refutation. If θ1 , . . . , θn is the sequence of mgus used in SLD-refutation of P ∪ {G}, then a computed answer θ for P ∪{G} is obtained by restricting θ1 , . . . , θn to variables of G. We say that θ is a correct answer for P ∪ {G} if ∀((G)θ) is a logical consequence of P . The SLD-resolution is sound and complete, which means that the success set of a definite program is equal to its least Herbrand model. The completeness can alternatively be stated as follows. For every correct answer θ for P ∪ {G}, there exists a computed answer σ for P ∪ {G} and a substitution γ such that θ = σγ. Example 16 Consider the following logic program P1 , which determines, for √ each pair of numbers x1 and x2 whether x1 x2 is defined. Let Q1 denote the √ property to be “defined”, (f1 (x1 , x2 )) denote x1 x2 ; Q2 , Q3 and Q4 denote, respectively, the property of being an even number, nonnegative number and odd number. Q1 (f1 (x1 , x2 )) ← Q2 (x1 ), Q3 (x2 ) Q1 (f1 (x1 , x2 )) ← Q4 (x1 ) This program can be enriched with a program which determines whether a given number odd or even, but we will not address this issue here. To keep our example simple, we chose G0 =← Q1 (f1 (a1 , a2 )), where a1 = 2 and a2 = 3, and add Q2 (a1 ) ← and Q3 (a2 ) ← to the database. Now the process of SLDrefutation will proceed as follows: (1) G0 =← Q1 (f1 (a1 , a2 )) is unifiable with Q1 (f1 (x1 , x2 )), and the algorithm of unification can be applied as follows: Form S = {Q1 (f1 (a1 , a2 )), Q1 (f1 (x1 , x2 ))} Form DS = {x1 , a1 }. Put θ1 = 18
x1 /a1 . Now Sθ1 = {Q1 (f1 (a1 , a2 )), Q1 (f1 (a1 , x2 ))}. Find DSθ = {x2 , a2 } and put θ2 = x2 /a2 . Now Sθ1 θ2 is a singleton. (2) Form the next goal G1 =← (Q2 (x1 ), Q3 (x2 ))θ1 θ2 =← Q2 (a1 ), Q3 (a2 ). Q2 (a1 ) can be unified with the clause Q2 (a1 ) ← and no substitutions are needed. (3) Form the goal G2 =← Q3 (a2 ), and it is unified with the clause Q3 (a2 ) ←. (4) Form the goal G3 = 2. There is a refutation of P1 ∪ G0 , the answer is θ1 θ2 , and, because the goal G0 is ground, the correct answer is empty. 3.2
Useful Learning Techniques
The following three subsections define the types of learning, which are widely recognised and applied in neurocomputing and which will be used for simulations of SLD-resolution in neural networks later on.
3.2.1
Error-Correction Learning in Artificial Neural Networks
We will use the algorithm of error-correction learning to simulate the process of unification described in the previous section. Error-correction learning is one of the algorithms among the paradigms which advocate supervised learning. The supervised learning is the most popular type of learning implemented in artificial neural networks, and we give a brief sketch of error-correction algorithm in this subsection; see, for example, [4] for further details. Let dk (t) denote some desired response for unit k at time t. Let the corresponding value of the actual response be denoted by vk (t). The response vk (t) is produced by a stimulus (vector) vj (t) applied to the input of the network in which the unit k is embedded. The input vector vk (t) and desired response dk (t) for unit k constitute a particular example presented to the network at time t. It is assumed that this example and all other examples presented to the network are generated by an environment. We define an error signal as the difference between the desired response dk (t) and the actual response vk (t) by ek (t) = dk (t) − vk (t). The error-correction learning rule is the adjustment ∆wkj (t) made to the weight wkj at time n and is given by ∆wkj (t) = ηek (t)vj (t), where η is a positive constant that determines the rate of learning. 19
Finally, the formula wkj (t + 1) = wkj (t) + ∆wkj (t) is used to compute the updated value wkj (t + 1) of the weight wkj . We use formulae defining vk and pk as in Section 1.1.2. Example 17 The neural network from Example 4 can be transformed into an error-correction learning neural network as follows. We introduce the desired response value dk into the unit k, and the error signal ek computed using dk must be sent to the connection between j and k to adjust wkj . v 0 JJJ
pj
JJ $ ONML / HIJK 00 Θj v tt9 t t t
v 000 3.2.2
j
r
wkj + ∆wkj
ek
XYZ[ / _^]\ Θk , dk
wkj
k
e/ k , vk
Filter Learning and Grossberg’s Law
Filter learning, similar to the Hebbian learning, is a form of unsupervised learning, see [2] for further details. Filter learning and in particular Grossberg’s law will be used in simulations of SLD-resolution at stages when a network, acting in accordance with conventional SLD-resolution, must choose, for each atom in the goal, with which particular clause in the program it will be unified at the next step. Consider the situation when a unit receives multiple input signals, v1 , v2 , . . . , vn , with vn distinguished signal. In Grossberg’s original neurobiological model [25], the vi , i 6= n, were thought of as “conditioned stimuli” and the signal vn was an “unconditioned stimulus”. Grossberg assumed that vi , i 6= n was 0 most of the time and took large positive value when it became active. Choose some unit c with incoming signals v1 , v2 , . . . , vn . Grossberg’s law is expressed by the equation new old old wci = wci + a[vi vn − wci ]U (vi ), (i ∈ {1, . . . , n − 1}),
where 0 ≤ a ≤ 1 and where U (vi ) = 1 if vi > 0 and U (vi ) = 0 otherwise. Example 18 Consider the unit j from Example 4. It receives multiple input values, and this means that we can apply Grossberg’s learning law to the weights connecting v 0 , v 00 and v 000 with j. In the next section we will also use the inverse form of Ginsberg’s law and apply the equation new old wic = wic (t)old + a[vi vn − wic ]U (vi ), (i ∈ {1, . . . , n − 1})
to enable (unsupervised) change of weights of connections going from some 20
unit c which sends outcoming signals v1 , v2 , . . . vn to units 1, . . . , n respectively. This will enable outcoming signals of one unit to compete with each other.
3.2.3
Kohonen’s Layer and Competitive Learning
The definition of a Kohonen Layer was introduced in conjunction with the famous Kohonen learning law [26], and both notions are usually thought of as fine examples of competitive learning. We are not going to use the latter notion in this paper and will concentrate on the notion of Kohonen layer, which is defined as follows. The layer consists of N units, each receiving n input signals v1 , . . . , vn from another layer of units. The vj input signal to Kohonen unit i has a weight wij assigned to it. We denote by wi the vector of weights (wi1 , . . . , win ), and we use v to denote the vector of input signals (v1 , . . . , vn ). Each Kohonen unit calculates its input intensity Ii in accordance with the following formula: Ii = D(wi , v), where D(wi , v) is the distance measurement function. The common choice for D(wi , v) is the Euclidian distance D(wi , v) = |wi − v|. Once each Kohonen unit has calculated its input intensity Ii , a competition takes place to see which unit has the smallest input intensity. Once the winning Kohonen unit is determined, its output vi is set to 1. All the other Kohonen unit output signals are set to 0. Example 19 The neural network below consists of three units organised in one layer; the competition according to the algorithm described above will take place between those three units, and only one of them will eventually emit the signal. O
O
O
z ) @ABC @ABC GFED @ABC / GFED / GFED w o w w1 Nf kVo VVV NNN VVVVppp8 O 2 Nf NNNhhhhhphphp3 8 O 3 O V h NN pp VVVV hhhh NN pp pppNhNhNhNhNhhhhVVVVpVpVpVpVpNNNNN p p VVVV NN h NN ppp VV phphphhhh
v1
3.3
v3
v2
SLD-resolution in neural networks
In order to perform SLD-resolution in neural networks, we will allow not only binary threshold units in the connectionist neural networks, but also units which may receive and send G¨odel numbers as signals. 21
We will use the fact that every first-order language yields a G¨odel enumeration. There are several ways of performing the enumeration, we just fix one as follows. Each symbol of the first-order language receives a G¨ odel number as follows: • • • • •
variables x1 , x2 , x3 , . . . receive numbers (01), (011), (0111), . . .; constants a1 , a2 , a3 , . . . receive numbers (21), (211), (2111), . . .; function symbols f1 , f2 , f3 , . . . receive numbers (31), (311), (3111), . . .; predicate symbols Q1 , Q2 , Q3 , . . . receive numbers (41), (411), (4111), . . .; symbols (, ) and , receive numbers 5, 6 and 7 respectively.
It is possible to enumerate connectives and quantifiers, but we will not need them here and so omit further enumeration. Example 20 The following is the enumeration of atoms from Example 16, the rightmost column contains short abbreviations we use for these numbers in further examples: Atom
G¨odel Number
Label
Q1 (f1 (x1 , x2 ))
41531501701166
g1
Q2 (x1 )
4115016
g2
Q3 (x2 )
411150116
g3
Q3 (a2 )
411152116
g4
Q2 (a1 )
4115216
g5
Q1 (f1 (a1 , a2 ))
41531521721166
g6
Disagreement set can be defined as follows. Let g1 , g2 be G¨odel numbers of two arbitrary atoms A1 and A2 respectively. Define the set g1 g2 as follows. Locate the leftmost symbols jg1 and jg2 in g1 and g2 which are not equal. If jgi , i ∈ {1, 2} is 0, put 0 and all successor symbols 1, . . . , 1 into g1 g2 . If jgi is 2, put 2 and all successor symbols 1, . . . , 1 into g1 g2 . If jgi is 3, then extract first two symbols after jgi and then go on extracting successor symbols until number of occurrences of symbol 6 becomes equal to the number of occurrences of symbol 5, put the number starting with jgi and ending with the last such 6 in g1 g2 . It is a straightforward observation that g1 g2 is equivalent to the notion of the disagreement set DS , for S = {A1 , A2 } as it is defined in Section 3.1. We will also need the operation ⊕, concatenation of G¨odel numbers, defined by g1 ⊕ g2 = g1 8g2 . Let g1 and g2 denote G¨odel numbers of a variable xi and a term t respectively. 22
We use the number g1 9g2 to describe the substitution σ = {x/t}, and we will call g1 9g2 the G¨ odel number of substitution σ. If the substitution is obtained for gm gn , we will write s(gm gn ). If g1 is a G¨odel number of some atom A1 , and s is a concatenation of G¨odel numbers g10 9g20 8g100 9g200 8 . . . 8g1000 9g 000 of some substitutions σ 0 , σ 00 , . . . , σ 000 , then g1 s is defined as follows: whenever g1 contains a substring (g1 )∗ such that (g1 )∗ is equivalent to some substring si of s such that either si contains the first symbols of s up to the first symbol 9 or si is contained between 8 and 9 in s, but does not contain 8 or 9, substitute this substring (g1 )∗ by the substring s0i of symbols which success si 9 up to the first 8. It easy to see that g1 s describes (A1 )σ1 σ2 . . . , σn , as it is described in Section 3.1. Implemented in Neural networks, G¨odel numbers can be used as positive or negative signals, and we put g1 s to be 0 if s = −g1 . Unification algorithm can be restated in terms of G¨odel numbers as follows: Let g1 and g2 be G¨odel numbers of two arbitrary atoms A1 an A2 . (1) Put k = 0 and the G¨odel number s0 of substitution σ0 equal to 0. (2) If g1 sk = g2 sk then stop; sk is an mgu of g1 and g2 . Otherwise, find the disagreement set (g1 sk ) (g2 sk ) of g1 sk and g2 sk . (3) If there exist number g 0 starting with 0 and number g 00 in g1 g2 such that g 0 does not occur as a sequence of symbols in g 00 , then put sk+1 = sk ⊕ g 0 9g 00 , increment k and go to 2. Otherwise, stop; g1 and g2 are not unifiable. Now we are ready to state and prove the main Theorem of this section. Theorem 21 Let P be a definite logic program and G be a definite goal. Then there exists a 3-layer backpropagation neural network which computes the G¨odel number s of substitution θ if and only if SLD-refutation derives θ as an answer for P ∪ {G}. We will call the neural networks of this architecture SLD neural networks.
PROOF. The proof will consist of two major parts: description of architecture of SLD neural networks and inductive proof that these neural networks can simulate the work of SLD-refutation. Let P be a logic program and let C1 , . . . , Cm be definite clauses contained in P. The SLD neural network consists of three layers, Kohonen’s layer k of input units k1 , . . . , km , layer h of output units h1 , . . . , hm and layer o of units o1 , . . . , on , where m is number of clauses in the logic program P , and n is the number of all atoms appearing in the bodies of clauses of P . Similar to the 23
connectionist neural networks of [3], each input unit ki represents the head of some clause Ci in P , and is connected to precisely one unit hi , which is connected, in its turn, to units ok , . . . , os representing atoms contained in the body of Ci . This is the main similar feature of SLD neural networks and connectionist neural networks of [3]. Note that in neural networks of [3] o was an output layer, and h was hidden layer, whereas in our setting h will be an output layer and we require the reverse flow of signals comparting with [3]. Thresholds of all the units are set to 0. The input units k1 , . . . , km will be involved in the process of supervised learning and this is why each of k1 , . . . , km must be characterised by the value of the desired response dki , i ∈ {1, . . . , m}, and each dki is the G¨odel number of the atom Ai which is the head of the clause Ci . Initially all weights between layer k an layer h are set to 0, but an error-correction learning function ϕ2 is introduced in each connection between ki and hi . The weight from each hi to some oj is defined to be the G¨odel number of the atom represented by oj . Consider G that contains atoms B1 , . . . , Bn , and let g1 , . . . gn be the G¨odel numbers of B1 , . . . , Bn . Then, for each gl , do the following: at time t send a signal vl = 1 to each unit ki . Predicate threshold function will be assumed throughout the proof, and is stated as follows. Set the weight of the connection wki ,l (t) to gl (l ∈ {1, . . . , n}) if gl has the string of 1s after 3 of the same length as the string of 1s succeeding 3 in dki (there may be several such signals from one gl , and we denote them by vl1 , . . . , vlm ). Set the weight wki l (t) of each connection between l and ki equal to 0 if either of the two conditions holds: (1) the first symbol of wki l is not 3; (2) the first symbol of wki l is 3, but the succeeding string of 1’s has a different length comparing with the string of 1’s succeeding symbol 3 in dki . It is easy to see that the Predicate threshold function serves as analogue of threshold value Θ for each input unit. Step 1. Suppose several input signals vl1 (t), . . . , vlm (t) were sent from one source to unit ki . At time t, only one of vl1 (t), . . . , vlm (t) can be activated, and we apply the inverse Grossberg’s law, see Section 3.2.2, to filter the signals vl1 (t), . . . , vlm (t) as follows. Fix the unconditioned signal vl1 (t) and compute, for each j ∈ {2, . . . , m}, wknew (t) = wkold (t) + [vl1 (t)vlj (t) − wkold (t)]U (vlj ). We i lj i lj i lj will also refer to this function as ψ1 (wki lj (t)). This filter will set all the weights wki lj (t), where j ∈ {2, . . . , m} to 1, and the Predicate threshold will ensure that those weights will be inactive. The use of the inverse Grossberg’s law here reflects the logic programming 24
convention that each goal atom unifies only with one clause at a time. Yet several goal atoms may be unifiable with one and the same clause, and we use Grossberg’s law (see Section 3.2.2) to filter signals of this type as follows. If an input unit ki receives several signals vj (t), . . . , vr (t) from different sources, then fix an unconditioned signal vj (t) and apply, for all m ∈ {(j + 1), . . . , r} the equation wknew (t) = wkold (t) + [vm (t)vj (t) − wkold (t)]U (vm ) at time t, we im im im will refer to this function as ψ2 (wki m (t)). The function ψ2 will have the same effect as ψ1 : all the signals except vj (t) will have to pass through connections with weights 1, and the Predicate threshold will make them inactive at time t. Functions ψ1 and ψ2 will guarantee that each input unit processes only one signal at a time. At this stage we could start further computations independently at each input unit, but the algorithm of SLD-refutation treats each non-ground atom in a goal as dependent on others via variable substitutions, that is, if one goal atom unifies with some clause, the other goal atoms will be subjects to the same substitutions. This is why we must avoid independent, parallel computations in the input layer and we apply the principles of competitive learning as they are realized in Kohonen’s layer, see Section 3.2.3. At time t + 1, compute Iki (t + 1) = D(wki j , vj ), for each ki , as described in Section 3.2.3. The unit with the least Iki (t+1) will proceed with computations of pki (t+1) and vki (t+1), all the other units kj 6= ki will automatically receive value vkj (t + 1) = 0. Note that if neither of wki j (t + 1) contains symbol 0 (all goal atoms are ground), we don’t have to apply Kohonen’s competition and can proceed with parallel computations for each input unit. Now, given an input signal vj (t + 1), the potential pki (t + 1) will be computed using the standard formula: pki (t + 1) = vj (t + 1)wki j − Θk , where, as we defined before, vj (t + 1) = 1, wki j = gj and Θk = 0. The output signal from ki is computed as follows: vki (t+1) = pki (t+1), if pki (t+1) > 0, and vki (t+1) = 0 otherwise. At this stage the input unit ki is ready to propagate the signal vki (t+1) further. However, the signal vki (t + 1) may be different from the desired response dki (t + 1), and the network initialises the supervised learning in order to bring signal vki (t + 1) in correspondence with the desired response. We compute the error signal ek (t + 1) following the definition of error-correction learning from Section 3.2.1, but we adjust computations of ek (t + 1) to operations defined on G¨odel numbers. Namely, eki (t + 1) = s(dki (t + 1) vki (t + 1)). If (dki (t+1) vki (t+1)) = ∅, then put eki (t+1) = 0, if (dki (t+1) vki (t+1)) 6= ∅ but s(dki (t + 1) vki (t + 1)) is empty, set eki (t + 1) = −wki j (t + 1). Now compute ∆wki j (t + 1) = eki (t + 1)vj (t + 1). In our case each vj (t + 1) was set to 1 and so ∆wki j (t + 1) = eki (t + 1). Send ∆wki j (t + 1) back and compute wki j (t + 2) = ϕ1 (wki j (t + 1)) = wki j (t + 1) ∆wki j (t + 1); update dki (t + 2) = 25
dki (t + 1) eki (t + 1); send ∆wki j (t + 1) forth to compute whi ki (t + 2) = ϕ2 (whi ki (t + 1)) = whi ki (t + 1) ⊕ ∆wki j (t + 1) if ∆wki j (t + 1) ≥ 0 and put whi ki (t + 2) = 0 otherwise. Increment t and repeat all the computations for t + 2. Note that the iterations of error-correction learning at unit ki works precisely as an algorithm of unification, applied to the G¨odel numbers dki and wki j and the sequence of eki (t), eki (t + 1), . . . , eki (t + ∆t) is in fact a sequence of relevant substitutions. Operation applies each of these substitutions to wki j and dk ; operation ⊕ accumulates these substitutions. If eki (t + ∆t) = 0 (∆t ≥ 1), set wkj (t + ∆t + 2) = 0, where j is the impulse previously trained via error-correction algorithm; change input weights leading from all other sources r, r 6= j, using wkn r (t+∆t+2) = wkn r (t) wki hi (t+∆t). Whenever at time t + ∆t (∆t ≥ 1), eki (t + ∆t) ≤ 0, set the weight whi ki (t + ∆t + 2) = 0. Furthermore, if eki (t + ∆t) = 0, initialise at time t + ∆t + 2 new activation of Grossberg’s function ψ2 (for some fixed vm 6= vj ); if eki (t + ∆t) < 0, initialise at time t + ∆t + 2 new activation of inverse Grossberg’s function φ1 (for some vli 6= vl1 ). In both cases initialise Kohonen’s layer competition at time t + ∆t + 3. Step 2. Whenever eki (t + ∆t), (∆t ≥ 1), is 0, compute phi (t + ∆t) = vki (t + ∆t) whi ki (t + ∆t) − Θhi , where Θhi = 0 as before. Note that according to the definition of ⊕ and because of the learning function φ2 defined as whi ki (t+1) = whi ki (t) ⊕ ek (t), whi ki (t + ∆t) has the form (0eki (t + 1)8 . . . 8eki (t + ∆t)). To compute vhi (t + ∆t), put vhi (t + ∆t) = whi ki (t + ∆t) if phi (t + ∆t) > 0 and vh (t + ∆t) = 0 otherwise. The signal vhi (t + ∆t) is sent both as the input signal to the layer o and as an output signal of the network which can be read by external recipient. The step 2 describes how the network accumulates the relevant substitutions and sends the string of these substitutions as an output of the neural network and as an input to the next layer. Step 3. As we defined already, hi is connected to some units ol , . . . , or in the layer o with weights wol hi = gol , . . . , wor hi = gor . And vhi is sent to each ol , . . . , or at time t + ∆t + 1. The network will now compute, for each ol , pol (t + ∆t + 1) = wol hi vhi − Θol , with Θol = 0. Put vol (t + ∆t + 1) = 1 if pol (t + ∆t + 1) > 0 and vol (t + ∆t + 1) = 0 otherwise. At step 3 the network applies obtained substitutions to the atoms in the body of the clause whose head has been unified already. Step 4. At time t + ∆t + 2, vol (t + ∆t + 1) is sent to the layer k. Note that all weights wkj ol (t + ∆t + 2) were defined to be 0, and we introduce the learning 26
function ϑ = ∆wkj ol (t + ∆t + 1) = pol (t + ∆t + 1)vol (t + ∆t + 1), which can be seen as a kind of Hebbian function defined in Section 2.2. At time t + ∆t + 2 the network computes wkj ol (t+∆t+2) = wkj ol (t+∆t+1)+∆wkj ol (t+∆t+1). At step 4, the new goals, which are the body atoms (with applied substitutions) are formed and sent to the input layer. Once the signals vol (t + ∆t + 2) are sent as input signals to the input layer k, the Grossberg’s functions will be activated at time (t + ∆t + 2), Kohonen competition will take place at time (t + ∆t + 3) as described in Step 1 and thus the new iteration starts. Computing and reading the answer The signals vhi are read from the hidden layer h, and as can be seen, are G¨odel numbers of relevant substitutions. We say that an SLD neural network computed an answer for P ∪ {G}, if and only if, for each external source i and internal source os of input signals vi1 (t), vi2 (t), . . . , vin (t) (respectively vos1 (t), vos2 (t), . . . , vosn (t)), the following holds: for at least one input signal vil (t) (or vosl (t)) sent from the source i (respectively os ), there exists vhj (t + ∆t), such that vhj (t + ∆t) is a string of length l ≥ 2 whose first and last symbol is 0. If, for all vil (t) ( vosl (t) respectively), vhj (t + ∆t) = 0 we say that the computation failed. Backtracking is one of the major techniques in SLD-resolution. We formulate it in the SLD neural networks as follows. Whenever vhj (t + ∆t) = 0, do the following. (1) Find the corresponding unit kj and wkj ol , apply the inverse Grossberg’s function ψ1 to some vos , such that vos has not been a unconditioned signal before. (2) If there is no such vos , find unit hf connected to os and go to item 1. The rest of the proof proceeds by routine induction, see Appendix B. 2
Example 22 Consider the logic program from Example 16. The corresponding SLD neural network can be built as follows. G¨odel numbers g1 , . . . , g6 are taken from Example 20. Input layer k consists of units k1 , k2 , k3 and k4 , each with the desired response value dki ; the hidden layer h consists of units h1 , h2 , h3 and h4 ; output layer o consists of units o1 , o2 and o3 . Set dk1 = g1 , dk2 = g2 , dk3 = g3 , dk4 = g4 . Then the steps of computation of the answer for the goal G0 =← Q1 (f1 (a1 , a2 )) from Example 16 can be performed by the following Neural network: 27
i14GO4GOOO 4 GG OO C g 44gG6GGgOG6OOOgO6 ∆w 6 4
44 GGG OOOO G# OOO' GFED @ABC @ABC @ABC GFED @ABC dk1 II GFED dk2 KK GFED dk dk O S IIE U KKG K 3 KKKKB I 4 LLLL K ek1KK KKK KKKKKek LLLLLek KK KK3 LL4 KK LL KKKK KKKK KKKK LL KK KKK KK LL K K LL K K K %?>=< % % %?>=< g9 89:; 89:; ?>=< 89:; 89:; ?>=< g10 h h h4 h 1 2 3 q y z q q z y q z y q y qq zz yy g qg z qqq 2zzz 3 yyyy q y qx qqq z} zz y| y 89:; ?>=< 89:; ?>=< 89:; ?>=< o3 o1 o2
vh1 (4)
vh3 (7) vh4 (7)
(1) At time t = 1, four signals v1 = v2 = v3 = v4 = 1 are sent from one external source i1 , and wk1 i1 (1) = wk2 i1 (1) = wk3 i1 (1) = wk4 i1 (1) = g6 . Predicate threshold function will set wk3 i1 (1) = wk4 i1 (1) = 0. Still two signals from one source are active and we apply inverse Grossberg’s law. From the remaining active signals v1 and v2 we pick v1 as unconditioned, and compute wknew (1) = wkold (1) + [v1 (1)v2 (1) − wkold (1)]U (v2 ) = g6 + 2 i1 2 i1 2 i1 [1 − g6 ]1 = 1. Now we apply the Predicate threshold and get wknew (1) = 2 i1 0. At this stage neither of Grossberg’s laws can be applied; Kohonen’s competition is not applicable either. Thus, we proceed with pk1 (1) = 1g6 = g6 and vk1 = g6 . Now supervised learning can be initiated: ek1 (1) = ∆wk1 i1 (1) = s(dk1 (1) vk1 (1)) = 01921. We are ready to update parameters using ∆wk1 i1 (1) = ek1 (1) as follows: wk1 i1 (2) = wk1 i1 (1) ∆wk1 i1 (1) = g6 01921 = g6 ; wh1 k1 (2) = wh1 k1 (1) ⊕ ∆wk1 i1 (1) = 0 ⊕ 01921 = 0801921; dk1 (2) = dk1 (1) ek1 (1) = g1 01921 = 41531521701166, call the latter number g7 . Next we increment time, put t = 2, and compute pk1 (2) = vk1 (2) = g6 ; ∆wk1 i1 (2) = ek1 (2) = s(dk1 (2) vk1 (2)) = 0119211. We update parameters using ∆wk1 i1 (2) = ek1 (2): wk1 i1 (3) = wk1 i1 (2) ∆wk1 i1 (2) = g6 0119211 = g6 ; wh1 k1 (3) = wh1 k1 (2) ⊕ ∆wk1 i1 (2) = 080192180119211; dk1 (3) = dk1 (2) ek1 (2) = g7 0119211 = 41531521721166 = g6 . Increment t, put t = 3 and compute pk1 (3) = vk1 (3) = g6 ; ∆wk1 i1 (3) = ek1 (3) = 0. Thus, wk1 i1 (4) = wk1 i1 (3) = g6 ; wh1 k1 (4) = wh1 k1 (3)⊕ek1 (3) = 08019218011921180; dk1 (4) = dk1 (3) = g6 . Set wk1 i1 (6) = 0, wh1 k1 (6) = 0 and apply wh1 k1 (4) to change wk2 i1 (6) = wk2 i1 (1) wh1 k1 (4). (2) Because ek1 (3) = 0, at time t = 4 we compute potential and value at the next level h, as follows: ph1 (4) = vk1 (4) wh1 k1 (4) = g6 08019218011921180 = g6 . And vh1 (4) = 08019218011921180. (3) At time t = 5 we proceed with computations at the next level o as follows: po1 (5) = wo1 h1 (4) vh1 (4) = g2 08019218011921180 = 4115216, call 28
the latter number g9 and compute vo1 (5) = 1. Similarly for the unit o2 , po2 (5) = wo2 h1 (4) vh1 (4) = g3 08019218011921180 = 411152116, call the latter number g10 and compute vo2 (5) = 1. Weights between layers o and k can be updated now as follows: wkj o1 (6) = wkj o1 (5) + (po1 (5)vo1 (5)) = 0 + g9 = g9 and similarly for o2 , wkj o2 (6) = wkj o2 (5) + (po2 (5)vo2 (5)) = 0 + g10 = g10 , for each j ∈ {1, . . . , 4}. New iteration starts. Apply predicate threshold and set wk1 o1 (6) = wk2 o1 (6) = wk4 o1 (6) = wk1 o2 (6) = wk2 o2 (6) = wk3 o2 (6) = 0. Neither of Grossberg’s functions can be applied here; because wk3 o1 (6) = g9 and wk4 o2 (6) = g10 do not contain symbol 0 (i.e, denote ground atoms), we do not impose Kohonen’s law anymore, and continue to propagate two input signals in parallel. pk3 (6) = vk3 (6) = g9 ;
pk4 (6) = vk4 (6) = g10 ;
ek3 (6) = 01921;
ek4 (6) = 0119211;
Update parameters:
Update parameters:
wk3 o1 (7) = wk3 o1 (6) = g9 ;
wk4 o2 (7) = wk4 o2 (6) = g10 ;
wh3 k3 (7) = 0801921;
wh4 k4 (7) = 080119211;
dk3 (7) = g4 01921 4115216 = g9 ;
=
dk4 (7) = g5 0119211 = 411152116 = g10 ;
pk3 (7) = vk3 (7) = g9 ;
pk4 (7) = vk4 (7) = g10 ;
ek3 (7) = 0;
ek4 (7) = 0;
wh3 k3 (7) = 080192180;
wh4 k4 (7) = 08011921180;
wh3 k3 (9) = 0;
wh4 k4 (9) = 0;
ph3 (7) = vk3 (7) wh3 k3 (7) = g9 , vh3 (7) = 080192180;
ph4 (7) = vk4 (7) wh4 k4 (7) = g10 , vh4 (7) = 08011921180.
No further signals can be sent; no further learning functions can be applied, yet for every input signal vi , we reached corresponding vhj which contains 0 as its first and its last symbol. Computations stop. The answer is: vh1 (4) = 0019218011921180, vh3 (7) = 080192180, vh4 (7) = 08011921180. It is easy to see that the output signals correspond to G¨odel numbers of substitutions obtained as an answer for P1 ∪ G0 in Example 16. Note that if we built a connectionist neural network of [3] which corresponds to the logic program P1 from example 16, we would need to built a neural networks with infinitely many units in all the three layers. And, since such networks cannot be built in the real world, we would finally need to use some approximation theorem which is, in general, non-constructive. 29
Several conclusions are to be made from the algorithm described above: • SLD neural networks have finite architecture, but their effectiveness is due to six learning functions: two Grossberg’s filter learning functions ψ1 and ψ2 , supervised learning functions ϕ1 and ϕ2 , Predicate threshold function, Kohonen’s competitive learning, Hebbian learning function ϑ. The most important of those functions are ϕ1 and ϕ2 , that provide supervised learning which, as we have shown, models the work of algorithm of unification. • Learning laws implemented in SLD neural network exhibit a “creative” component in SLD-resolution algorithm. Indeed, the search for successful unification, choice of goal atoms and program clause at each step of derivation are not fully determined by the algorithm, but leave us (or program interpreter) to make personal choice, and in this sense, allow certain “creativity” in the decisions. The fact that process of unification is simulated by means of error-correction learning algorithm reflects the fact that the unification algorithm is, in essence, a correction of one peace of data relatively to the other piece of data. This also suggests that unification is not totally deductive algorithm, but an adaptive process. • The Kohonen’s competitive layer serves to guard the network from undesirable parallelism. Essentially, it restricts the full power of parallel computations in the neural networks to the power of classical SLD-resolution. This restriction may seem to be somewhat disconcerting, but we had to introduce it, because massive parallelism of computations would destroy soundness and completeness of SLD-derivations, because it would not respect the logic programming convention that one variable appearing in several goal atoms must be uniformly substituted at all steps of derivation. However, if we work with ground goals, the Kohonen’s competition is not needed and SLD neural networks can perform massively parallel computations. We simply need to relieve the Kohonen’s competition when active input weights do not contain symbol 0. • The SLD neural networks are not feedforward, as connectionist neural networks of [3], but require backpropagation. • Atoms and substitutions of the first-order language are represented in SLD neural networks internally via G¨odel numbers of weights and other parameters. This distinguishes SLD neural networks from the connectionist neural networks of [3], where symbols appearing in a logic program were not encoded in the corresponding neural network directly, but each unit was just “thought of” as representing some atom. This suggests that SLD neural networks allow easier machine implementations comparing with the neural networks of [3]. • The SLD neural networks can realize either depth-first or breadth-first search algorithms implemented in SLD-resolution, and this can be fixed by imposing some conditions on the choice of unconditioned stimulus during the use of Grossberg’s law in layer k.
30
4
Conclusions and Further Work
We have introduced the two algorithms: the first of them shows how learning neural networks can perform computations of the semantic operator defined for logic programs with uncertainty and the second algorithm enables us to simulate the work of SLD-resolution by learning neural networks. The first approach uses traditional techniques introduced in [3] and surveyed in Section 1.1.2, and essentially just enriches the neural networks of [3] in order to perform computations of the semantic operator for logic programs with uncertainty. The main virtue of this approach is that it exhibits some nontrivial properties of these logic programs, and also suggests that different types of non-classical, many-valued and (bi)lattice-based logics and logic programs may have interesting implementations in neurocomputing. The neural networks built in Section 2.2 can be used, with minor modifications, for logic programs for reasoning with uncertainty described in [14,15,27,16,17,28,20,19]. See also [5] for discussions of relations of non-monotonic and inductive reasoning with learning in neural networks. Simplicity of architecture of these neural networks can be seen as their advantage, especially in case if they are built on propositional logic programs. However, for first-order logic programs with individual or annotation function symbols these neural networks are not practical, although the approximation theorem establishes that the approximation of infinite neural networks by finite neural networks is possible. The second approach is more innovative in that it advocates the method of SLD-resolution, rather than fixpoint semantics, as the suitable formalism for neural-symbolic integration. This approach seems to be closer to the natural neural networks computations, because clearly, natural neural networks are finite in space and time. Moreover, human reasoning is similar to SLDrefutation algorithm, in that it is essentially goal oriented. This is why the neural networks simulating the semantic operator and computing all (possibly infinite) logic consequences of a given logic program do not seem natural from philosophical perspective. SLD neural networks have computational benefits too: both unification algorithm and breadth-first/depth-first search which are involved in SLD-refutation are known to be P-complete ([29],[30]), and SLD neural networks which simulate SLD-refutation will inherit this characteristic. In [21] and [22] we proposed sound and complete SLD-resolution for BAPs. Simulations of this resolution by SLD neural networks should employ all the learning techniques of Theorems 11 and 21. The SLD neural networks for BAPs should be primary based on SLD neural networks we have built, but with Hebbian and Anti-Hebbian learning functions of Theorem 11 added. This algorithm would also require some more sophisticated G¨odel numbering, and this is why we do not address the issue of building SLD neural networks for 31
BAPs in this paper, although it may be an interesting exercise to do in the future. Returning to the main question we posed in Introduction, we conclude the following: neural networks were shown to be able to perform deductive reasoning, as we have shown in Sections 1.1.2, 2 and 3. But, what is more important, the learning neural networks we have built in Sections 2 and 3 are computationally equivalent to the logic counterparts they simulate. Thus, we can make the conclusion that both semantic operator for bilattice-based annotated logic programs and SLD-resolution bear certain properties which can be realized by learning algorithms in neurocomputing. This conclusion sounds as apology for symbolic (deductive) theories, because it suggests that symbolic reasoning which we normally call “deductive” does not necessarily imply the lack of learning and spontaneous self-organisation in the sense of neurocomputing.
5
Acknowledgements
One section of this paper was presented at the International Conference on Computability in Europe 2006 (CiE’06) and I thank anonymous referees and Dr.D.Woods for useful suggestions concerning a preliminary version of the paper published in Local Proceedings of CiE’06. I am particularly grateful to Dr. Anthony Seda for organising the work of the research group Nature-Inspired Models of Computations and for his supervision of my PhD studies. I acknowledge partial support of Boole Centre for Research in Informatics in the preparation of this paper. I am grateful to Association for Symbolic Logic and Organisers of CiE’06 for offering two generous student grants which covered all the expenses for participation in CiE’06.
References
[1] J. W. Dawson, G¨ odel and origins of computer science, in: A.Beckmann, U.Berger, B.L¨ owe, J.V.Tucker (Eds.), Logical Approaches to Computational Barriers, CiE’06, Vol. 3988 of LNCS, 2006, pp. 133–137. [2] R. Hecht-Nielsen, Neurocomputing, Addison-Wesley, 1990. [3] S. H¨ olldobler, Y. Kalinke, H. P. Storr, Approximating the semantics of logic programs by recurrent neural networks, Applied Intelligence 11 (1999) 45–58.
32
[4] S. Haykin, Neural Networks. A Comprehensive Foundation, Macmillan College Publishing Company, 1994. [5] A. d’Avila Garcez, K. B. Broda, D. M. Gabbay, Neural-Symbolic Learning Systems: Foundations and Applications, Springer-Verlag, 2002. [6] A. d’Avila Garcez, G. Zaverucha, L. A. de Carvalho, Logical inference and inductive learning in artificial neural networks, in: C.Hermann, F.Reine, A.Strohmaier (Eds.), Knowledge Representation in Neural Networks, Logos Verlag, Berlin, 1997, pp. 33–46. [7] A. d’Avila Garcez, G. Zaverucha, The connectionist inductive learning and logic programming system, Applied intelligence, Special Issue on Neural networks and Structured Knowledge 11 (1) (1999) 59–77. [8] J. W. Lloyd, Foundations of Logic Programming, 2nd Edition, Springer-Verlag, 1987. [9] M. van Emden, R. Kowalski, The semantics of predicate logic as a programming language, Journal of the Assoc. for Comp. Mach. 23 (1976) 733–742. [10] P. Hitzler, S. H¨ olldobler, A. K. Seda, Logic programs and connectionist networks, Journal of Applied Logic 2(3) (2004) 245–272. [11] A. K. Seda, On the integration of connectionist and logic-based systems, in: T. Hurley, M. Mac an Airchinnigh, M. Schellekens, A. K. Seda, G. Strong (Eds.), Proceedings of MFCSIT2004, Trinity College Dublin, July, 2004, Electronic Notes in Theoretical Computer Science, Elsevier, 2005. [12] S. Bader, P. Hitzler, A. Witzel, Integrating first-order logic programs and connectionist systems — a constructive approach, in: A. S. d’Avila Garcez, J. Elman, P. Hitzler (Eds.), Proceedings of the IJCAI-05 Workshop on NeuralSymbolic Learning and Reasoning, NeSy’05, Edinburgh, UK, 2005. [13] M. L. Ginsberg, Multivalued logics: a uniform approach to reasoning in artificial intelligence, Computational Intelligence 4 (1988) 265–316. [14] M. C. Fitting, Bilattices and the semantics of logic programming, Journal of logic programming 11 (1991) 91–116. [15] M. C. Fitting, Bilattices in logic programming, in: G. Epstein (Ed.), The twentieth International Symposium on Multiple-Valued Logic, IEEE, 1990, pp. 238–246. [16] M. Kifer, E. L. Lozinskii, RI: A logic for reasoning with inconsistency, in: Proceedings of the 4th IEEE Symposium on Logic in Computer Science (LICS), IEEE Computer Press, Asilomar, 1989, pp. 253–262. [17] M. Kifer, V. S. Subrahmanian, Theory of generalized annotated logic programming and its applications, Journal of logic programming 12 (1991) 335– 367.
33
[18] E. Komendantskaya, A. K. Seda, V. Komendantsky, On approximation of the semantic operators determined by bilattice-based logic programs, in: Proceedings of the Seventh International Workshop on First-Order Theorem Proving (FTP’05), Koblenz, Germany, 2005, pp. 112–130. [19] J. J. Lu, N. V. Murray, E. Rosental, A framework for automated reasoning in multiple-valued logics, Journal of Automated Reasoning 21 (1) (1998) 39–67. [20] J. J. Lu, N. V. Murray, E. Rosental, Deduction and search strategies for regular multiple-valued logics, Journal of Multiple-valued logic and soft computing 11 (2005) 375–406. [21] E. Komendantskaya, A. K. Seda, Logic programs with uncertainty: neural computations and automated reasoning, in: Proceedings of the International Conference Computability in Europe, Swansea, Wales, 2006, pp. 170–182. [22] E. Komendantskaya, A. K. Seda, Declarative and operational semantics for bilattice-based annotated logic programs, in: Proceedings of the Fourth Irish Conference on the Mathematical Foundations of Computer Science and Information Technology (MFCSIT), UCC, Cork, 2006, pp. 229–233. [23] J. A. Robinson, A machine-oriented logic based on resolution principle, Journal of ACM 12 (1) (1965) 23–41. [24] R. A. Kowalski, Predicate logic as a programming language, in: Information Processing 74, Stockholm, North Holland, 1974, pp. 569–574. [25] S. Grossberg, Embedding fields: A theory of learning with physiological implications, J. Math. Psych. 6 (1969) 209–239. [26] T. Kohonen, Self-Organization and Associative memory, second edition Edition, Springer-Verlag, Berlin, 1988. [27] M. van Emden, Quantitative deduction and fixpoint theory, Journal of Logic Programming 3 (1986) 37–53. [28] J. J. Lu, Logic programming with signs and annotations, Journal of Logic and Computation 6 (6) (1996) 755–778. [29] C.Dwork, P.C.Kanellakis, J.C.Mitchell, On the sequential nature of unification, Journal of Logic Programming 1 (1984) 35–50. [30] R.Greenlaw, H. J. Hoover, W.L.Ruzzo, A compendium of problems complete for P, Tech. Rep. TR 91-05-01, Department of Computer Science and Engineering, University of Washington (1991).
A
Inductive part of the proof of Theorem 11.
Subproof 1. We prove that if some annotated atom A : (µ, ν) is computed by TP , then it is computed by a neural network described in construction of Theorem 11.
34
If A : (α, β) ∈ TP ↑ n, then there is a clause A : (α, β) ← B1 : (α1 , β1 ), . . . , Bm : (αm , βm ) in ground(P ) such that one of the following holds: 0 , β 0 )} ⊆ T ↑ (n − 1) and for each (α0 , β 0 ) one of i. {B1 : (α10 , β10 ), . . . , Bm : (αm P m i i the following holds:
(a) (αi0 , βi0 ) ≤k (αi , βi ), (b) (αi0 , βi0 ) ≤k (αj , βj ) ⊗ . . . ⊗ (αl , βl ) (i, j, l ∈ {1, . . . , m}), whenever Bi = Bj = . . . = Bl . ii. there are annotated strictly ground atoms A : (α1∗ , β1∗ ), . . . , A : (αk∗ , βk∗ ) ∈ TP ↑ (n − 1) such that (α, β) ≤k (α1∗ , β1∗ ) ⊕ . . . ⊕ (αk∗ , βk∗ ). Consider the cases i.(a) and i.(b). Let c be the unit in the hidden layer associated with this clause according to item 2.1 of the construction in the translation algorithm of Theorem 11. The part of the proof for i.(a) corresponds to the proof of the similar theorem by H¨olldobler and al., see [3]. The proof for cases i.(b) and ii. involves the learning functions. Consider the case i.(b). We know that there is a clause A : (α, β) ← B1 : (α1 , β1 ), . . . , Bm : (αm , βm ) in ground(P ), which means that each unit representing one of B1 : (α1 , β1 ), . . . , Bm : 0 , β0 ) ∈ (αm , βm ) is connected to c with weight 1. Since B1 : (α10 , β10 ), . . . , Bm : (αm m TP ↑ (n − 1), these elements are in the annotation Herbrand base which means that there are units representing them in the input layer (see item 1 in the description of the neural network). Thus, all these units are connected to c via ⊗-connections 0 , β 0 ) ∈ T ↑ (n − 1) according to item 2.5. The fact that B1 : (α10 , β10 ), . . . , Bm : (αm P m shows that these units were activated at time t0 . Now, the item i.(b) in description of TP above says that for some (αi0 , βi0 ), (αi0 , βi0 ) ≤k (αj , βj ) ⊗ . . . ⊗ (αl , βl ) (i, j, l ∈ {1, . . . , m}), and Bi = Bj = . . . = Bl . This means that the function φ1 is activated at time t + 1 according to item 5.1 in description of the neural network architecture and the weight from the unit representing Bi : (αi0 , βi0 ) to c is set to 1 at time t + 1. Consequently, item 2.2 in the description of the neural network and the fact that units occurring in the output layer have a threshold 0.5 (see item 1) ensure that the unit representing A : (α, β) in the output layer becomes active at time t + 2. Consider the case ii. Since A : (α1∗ , β1∗ ), . . . , A : (αk∗ , βk∗ ) ∈ TP ↑ (n − 1), there are units representing these annotated strictly ground atoms in the input and output layers, and the units in the input layer became active at time t. Moreover, according to item 3 in the description of the neural network architecture, all these units in the input and output layers have ⊕ -connections. This is why we can conclude that φ2 is activated at time t + 2, and therefore the unit representing A : (α, β) becomes active. Subproof 2. We prove that, if a neural network described in the proof of Theorem 11 computes some annotated atom, then this atom will be computed by TP .
35
Suppose that the unit representing the annotated atom A : (α, β) in the output layer becomes active at time t + 2. From the construction of the network, we have two situations when this can happen because the signal could come either from a regular hidden unit or from ⊕-hidden unit. (1) Consider the situation when the signal comes from a regular hidden unit c which became active at time t + 1. This hidden unit is associated with the clause A : (α, β) ← B1 : (α1 , β1 ), . . . , Bm : (αm , βm ). If m = 0, that is, if the body of the clause is empty, then, according to item 2.4 in the proof of Theorem 11, c has a threshold −0.5. Furthermore, according to item 2.3 in the same proof, c does not receive any input, that is, pc = 0 + 0.5 and consequently c will always be active. If m ≥ 1, the hidden unit c could be activated at time t + 1 in one of the following cases: [1.a] Each unit in the input layer representing an annotated atom in the body of the clause (and having a 1-connection with c) was active at time t0 (see items 2.3 and 2.4 in the proof of Theorem 11). Hence, we have found a strictly ground clause A : (α, β) ← B1 : (α1 , β1 ), . . . , Bm : (αm , βm ) such that for all 1 ≤ j ≤ m we have Bi : (αi , βi ) ∈ TP ↑ (n − 1), and consequently, A : (α, β) ∈ TP ↑ n. [1.b] The hidden unit c can be activated when some of the signals were received when a learning function φ1 was activated at time t + 1 and acted via the ⊗ connection from a unit representing some annotated atom Bi : (αi0 , βi0 ), provided that ((Bi = Bj ), . . . , (Bi = Bk )) for some j, k ∈ {1, . . . , m} and (αi0 , βi0 ) ≥k (αj , βj )⊗. . .⊗ (αk , βk ). (The signals other than received via ⊗-connections are thought of as being received as in item 1.a.) Then we can see that we have found a strictly ground clause A : (α, β) ← B1 : (α1 , β1 ), . . . , Bm : (αm , βm ), where for some Bj : (αj , βj ), . . . , Bk : (αk , βk ), ((Bi = Bj ), . . . , (Bi = Bk )) such that (αi0 , βi0 ) ≥k (αj , βj ) ⊗ . . . ⊗ (αk , βk ), the annotated atom Bi : (αi0 , βi0 ) ∈ TP ↑ (n − 1). Hence, using the assumption that all other annotated atoms except for Bi : (αi0 , βi0 ) either can be treated in a similar way to Bi : (αi0 , βi0 ) or according to item 1.(a), and using the definition of TP (see Definition 8, item 1b), we obtain that A : (α, β) ∈ TP ↑ n. (Note that this case covers the case when some Bi : (αi0 , βi0 ) ∈ TP ↑ (n − 1) and (αi0 , βi0 ) ≥k (αi , βi ), for some i ∈ {1, . . . , m}.) (2) Suppose the unit representing A : (α, β) is activated as soon as the learning function φ2 is activated at time t+2 via the ⊕ -connection. Then, by the construction of the network (see items 3 and the definition of φ2 in the proof of Theorem 11), there are units representing annotated atoms A : (αi∗ , βi∗ ), . . . , A : (αk∗ , βk∗ ) in the input layer such that they are connected to A : (α, β) via ⊕ -connections, they became active at time t0 , and (α, β) ≤k (αi∗ , βi∗ ) ⊕ . . . ⊕ (αk∗ , βk∗ ). This means that A : (αi∗ , βi∗ ), . . . , A : (αk∗ , βk∗ ) ∈ TP ↑ (n − 1). Then, according to the definition of TP (see Definition 8, item 2), A : (α, β) ∈ TP ↑ n. This completes the proof.
36
B
Inductive part of the proof of Theorem 21.
Let P be a logic program, G0 a program goal and NN be an SLD neural network built as described in Theorem 21. (In the proof we assume that we work with general case, when goals are non-ground, and therefore, we need to apply Kohonen’s competition at each step of computation.) Subproof 1. We prove that, if there is an SLD-refutation for P ∪ G0 with answer θ1 , . . . , θn , then NN will compute the G¨odel number of θ1 , . . . , θn . We proceed with the proof by induction on length n of refutation P ∪ G0 . Basis step. Let n = 1. Then G0 =← B and there exists a unit clause A ←, where A and B are first-order atomic formulae such that there exists θ making Aθ = Bθ. But then, by the construction of NN, there exists an input unit kj with dkj such that dkj is the G¨ odel number of A and the unit kj receives an input signal vi (t) = 1 via wkj i (t), such that wkj i (t) is the G¨odel number of B. If θ exists, then, by unification algorithm of Section 3.3 and definition of the supervised learning at layer k, e(t + ∆t) = 0 will be reached by NN at some time t + ∆t, and vhj (t + ∆t) will be computed, vhj (t + ∆t) being the G¨odel number of θ. Inductive step. Suppose the statement holds for the refutation of length n − 1. We will prove that then it holds for the length of refutation n. Consider some goal Gk−1 =← (B1 , . . . , Bl , . . . Bs )θ1 , . . . , θk−1 , which has a refutation of length n and the goal Gk =← (B1 , . . . , Bl−1 , C1 , . . . , Cm , Bl+1 , . . . Bs )θ1 , . . . , θk derived from Gk−1 and some clause A ← C1 , . . . , Cm , such that Bl is a selected goal atom, and there exist θ1 , . . . , θk making Aθk = Bl θ1 , . . . , θk . Clearly, Gk has the length of refutation n − 1, and, by induction hypothesis, NN computes the answer for Gk . By the construction of NN, there exists a unit kj , with dkj equal to the G¨odel number of A and there exists a unit hj connected both to kj and some oi1 , . . . , oim , such that each weight woij hj is the G¨ odel number of Cj . Because Gk−1 has a refutation, by the construction of NN, some input signal vf (t) will be sent to the layer k with weight wkr f (t) equal to the G¨ odel number of Bl θ1 , . . . , θk . And, because Bl has to be involved in further refutation and is unifiable with A, by the construction of NN we conclude that after applications of Predicate threshold, two Grossberg’s functions and Kohonens’ competition, vf (t) will reach kj and initialise the errorcorrection learning using dkj (t) at time t. Moreover, the fact that A and B are unifiable, suggests that at time t + ∆t NN will reach the state when ekj (t + ∆t) = 0, and vhj (t + ∆t) will be read of the output layer h and, since, hj is connected to oi1 , . . . , oim , vhj (t+∆t) will be sent to oi1 , . . . , oim . Further, each voij (t+∆t+2) = 1 will be sent to the input layer k with each wkr oij (t + ∆t + 2) equal to the G¨odel
37
number of Cj θk . But, according to the induction hypothesis, NN will compute the G¨odel numbers of answers for ← B1 , . . . , Bl−1 , C1 , . . . , Cm , Bl+1 , . . . , Bs . Then, we have proved that NN computes the G¨odel numbers of answers for P ∪Gk−1 . Subproof 2. We prove that, if SLD neural network computes vh1 , . . . , vhk performing G¨odel numbers s1 , . . . , sn , then s1 , . . . , sn are G¨odel numbers of substitutions which constitute an answer for P ∪G. We will prove this statement by induction on number n of iterations of NN. Basis step. Suppose n = 1, that is, only one iteration was needed to compute an answer vhj . This means that there was only one external impulse vi (1) which activated, through wkj i (1), some unit kj in the input layer k. Since the answer of SLD neural network is actually computed, no other signals are received by the input layer k, and from this we make a conclusion that hj is not connected to any unit in the layer o. Therefore, by construction of NN, we conclude that there was a goal G0 =← B, with B being some atom in the first-order language such that the G¨odel number of B is equal to wkj i (1); and there was a unit clause A ← in P , A being some atom in the first-order language, such that dkj is precisely the G¨odel number of A, moreover, there exists θ such that Aθ = Bθ with θ being such that its G¨odel number is vhj . But then, clearly, P ∪ {G0 } has an SLD-refutation, with the answer θ. Inductive step. Suppose the statement holds for n − 1 iterations. We prove that then it holds for n iterations. Consider the iteration n at which some vhj (t) was computed (that is, has the form 0ekj (t−2)8 . . . 8ekj (t−2−∆t)0), such that vh1 , . . . , vhj , . . . , vhk constitute the final answer, with each vhl (l 6= j) computed at one of the n − 1 iterations. By the construction of NN, the unit hj was first excited by the unit kj , which received, at some time t − ∆t − 2 either an input signal vi (t − ∆t − 2) (if it was an external impulse) or vos (t − ∆t − 2) (if it was an internal signal). By the construction of NN, we know that wos kj (t−∆t−2) (or wikj (t−∆t−2)) is the G¨odel number of some first-order atomic formula Bl , and dkj (t−∆t−2) is the G¨odel number of some first-order atomic formula A which is the head of some clause A ← C1 , . . . , Cm . Moreover, Bl is a selected goal atom of some G =← B1 , . . . , Bl , . . . , Bn . Since vhj (t) is of the form 0ekj (t − 2)8 . . . bekj (t − 2 + ∆t)0, we conclude, using the definition of ∆whj kj and the definition of unification algorithm given in Section 3.3, that Bi and A are unifiable via some θ, such that the G¨odel number of θ is precisely vhj (t). Furthermore, vhj (t) is sent to oi1 , . . . , oim , with each woir hj encoding the G¨odel number of Cr from A ← C1 , . . . , Cm . The output layer then emits signals voi1 , . . . , voim = 1 and the weight between each oij and layer k is set to the G¨odel number of Cj θ.
38
Now n − 1 iterations are left to be completed. And, according to the induction hypothesis, the signals emitted by NN during these n − 1 iteration correspond to the G¨odel numbers of relevant answers for SLD-refutation of P ∪ ← (B1 , . . . , Bl−1 , C1 , . . . , Cm , Bl−1 , . . . Bm )θ. Thus, NN computes vh1 , . . . , vhj , . . . , vhk which constitute the G¨odel number of an answer for P ∪ G in n iterations. This completes the proof.
39