Approximation Properties of Positive Boolean Functions Marco Muselli Istituto di Elettronica e di Ingegneria dell’Informazione e delle Telecomunicazioni, Consiglio Nazionale delle Ricerche, via De Marini, 6 - 16149 Genova, Italy
[email protected] Abstract. The universal approximation property is an important characteristic of models employed in the solution of machine learning problems. The possibility of approximating within a desired precision any Borel measurable function guarantees the generality of the considered approach. The properties of the class of positive Boolean functions, realizable by digital circuits containing only and and or ports, is examined by considering a proper coding for ordered and nominal variables, which is able to preserve ordering and distance. In particular, it is shown that positive Boolean functions are universal approximators and can therefore be used in the solution of classification and regression problems.
1
Introduction
An important topic in computational learning theory concerns the characterization of the class Γ of functions (models) to be adopted when searching for the solution of a classification or of a regression problem. It must be sufficiently rich to allow the treatment of a wide variety of real-world problems, it must ensure the natural handling of different kinds of input variables, it must permit the application of an efficient learning technique. For example, if Γ is the class of multilayer perceptrons with one hidden layer and sigmoidal activation functions, well established theoretical results guarantee that Γ possesses the universal approximation property [1], i.e. under mild conditions any Borel measurable real function can be approximated arbitrarily well by a multilayer perceptron if a sufficient number of neurons is included in the hidden layer. The same property characterizes other widely used connectionist models, such as radial basis function networks [2] or support vector machines [3]. Also the adoption of methods based on the synthesis of Boolean functions is motivated from the well known fact that they are universal approximators, a basic property for the functioning of computing systems. If the use of the complement operator not is not allowed, we can only realize the subset of positive Boolean functions, which does not include several simple binary mappings, such as the parity function. This limitation has prevented their use in the solution of B. Apolloni et al. (Eds.): WIRN/NAIS 2005, LNCS 3931, pp. 18–22, 2006. c Springer-Verlag Berlin Heidelberg 2006
Approximation Properties of Positive Boolean Functions
19
classification and regression problems, together with the lack of an efficient learning technique for the selection of a positive Boolean function that generalizes the information contained in the training set. This paper has the aim of refuting this opinion, proving that the class L of positive Boolean functions possesses the universal approximation property. The adoption of L in the solution of classification problems gives rise to a new connectionist model, called Switching Neural Network (SNN) [6], which can be completely described through a set of intelligible rules in the if-then form. SNNs are trained by a specific procedure for positive Boolean function reconstruction, called Shadow Clustering [7].
2
Definitions and Notations
Consider the Boolean lattice ({0, 1}n , ∨, ∧, 0, 1), where ∨ and ∧ are the logical sum (and) and the logical product (or), respectively. The addition of the complement operation not makes {0, 1}n a Boolean algebra, widely used in digital circuit design and in many other scientific fields. It is known that any Boolean function f : {0, 1}n → {0, 1} can be realized through an expression in this algebra. On the contrary, the lack of the complement operation in the Boolean lattice {0, 1}n allows to generate only the subset of positive Boolean functions f , for which f (x) ≤ f (y) if x ≤ y, being ≤ the standard ordering defined on lattices. To prove that positive Boolean functions are universal approximators, i.e. they can approximate arbitrarily well any measurable function g : Rd → R, let us define in a formal way the concept of approximating the elements of a set X through the elements of another set Y . To this aim suppose a metric ρ exists on X. Definition 1. A subset Z of a metric space (X, ρ) is ρ-dense in X if and only if for every ε > 0 and for every x ∈ X there is a z ∈ Z such that ρ(z, x) < ε. A set Y approximates arbitrarily well a metric space (X, ρ) if and only if a mapping η : Y → X can be found such that its range η(Y ) is ρ-dense in X. Definition 1 requires the choice of a proper mapping η to establish that a set Y approximates arbitrarily well a metric space (X, ρ). When X is a class of functions g : AX → BX and Y contains mappings f : AY → BY as elements, a possible way of constructing η consists in writing it as the composition η(f ) = ψ ◦ f ◦ ϕ of f with two mappings ϕ : AX → AY and ψ : BY → BX . With this choice η(f ) gives a function from AX to BX for any f ∈ Y . Note that if there is a mapping η from Y onto X, the set Y approximates arbitrarily well the space X, whatever is the metric ρ on it. A classical result in approximation theory ensures that the set of discrete funcd → Im approximates arbitrarily well the class of Borel measurable tions h : Im functions defined on Rd . This allows to establish the universal approximation property of Boolean functions f : {0, 1}b → {0, 1}k , by simply analyzing the d possibility of finding proper mappings ϕ : Im → {0, 1}b and ψ : {0, 1}k → Im
20
M. Muselli
which makes η(f ) = ψ ◦ f ◦ ϕ a mapping from the class of Boolean functions onto the set of discrete functions. The transitive property in turn allows to establish that Theorem 1. The class of Boolean functions f : {0, 1}b → {0, 1}k is able to approximate arbitrarily well the set of Borel measurable functions. It is important to observe that on any finite chain, such as Im , a natural metric dc can be introduced, called counter metric, defined as dc (a, b) = |a − b|, being |a − b| the length of the subchain connecting a and b. Another simple metric, which can be defined on any set is the flat metric df given by df (a, b) = 0 if a = b and 1 otherwise. Note that the counter metric represents a usual definition for measuring distances between the values of a (discrete) ordered variable, whereas the flat metric is a natural definition for nominal variables. These metrics can easily be d d by considering the extension dI (x, z) = i=1 di (xi , zi ), where generalized to Im di (xi , zi ) = dc (xi , zi ) or di (xi , zi ) = df (xi , zi ), depending on whether the counter or the flat metric is adopted on Im .
3
Approximation Property of Positive Boolean Functions
The following theorem offers a possible to construct the mapping η to be adopted for establishing the approximation capability of the set Ln of positive Boolean functions. Theorem 2. Let A be an antichain of the poset ({0, 1}n, ≤), i.e. for any x, y ∈ A with x = y neither x < y nor y > x holds. Then for every f : {0, 1}n → {0, 1} there is a positive Boolean function g : {0, 1}n → {0, 1} such that f (x) = g(x) for all x ∈ A. Proof. Take any f : A → {0, 1} and consider the Boolean function g defined in the following way: ⎧ ⎨ f (x) if x ∈ A if x > a for some a ∈ A g(x) = 1 ⎩ 0 otherwise Then, g is a positive Boolean function, since g(x) ≤ g(y) whenever x ≤ y. Consider the mapping P : {0, 1}n → 2In defined as P (a) = {i ∈ In : ai = 1} for any a ∈ {0, 1}n ; it produces the subset of indices of the components ai assuming value 1. It can be easily seen that the mapping P is an isomorphism between the posets ({0, 1}n, ≤) and (2In , ⊆). The inverse of P will be denoted with p; for any subset J ⊂ In it gives the element p(J) ∈ {0, 1}n whose i-th component pi (J) has value 1 if and only if i ∈ J. Now, consider the subsets Qln = {a ∈ {0, 1}n : |P (a)| = l} containing the strings of n bits with l 1s. It can be easily verified that every Qln is an antichain
Approximation Properties of Positive Boolean Functions
21
n with Cn,l = elements. Thus, for Th. 2 the approximation capability of l the class Ln can be established by considering proper mappings ζ having as codomain Qln for the construction of the mapping ϕ to be employed in η. Denote with Qln the collection of the subsets of In with cardinality l: Qln = {{j1 , j2 , . . . , jl } : 1 ≤ j1 < j2 < · · · < jl ≤ n, ji ∈ In , for i = 1, . . . , l} When the lexicographic ordering is used to compare the relative position of its elements (ordered in an increasing way), Qln becomes a chain. Consequently, the counter metric dc can be defined on Qln , being dc (B, C) the length of the subchain connecting B and C for every B, C ∈ Qln . According to this definition, if B = {1, 2, . . . , h − 1, jh − 1, jh+1 , . . . , jl } and C = {1, 2, . . . , h − 1, jh , jh+1 , . . . , jl }, with h < jh < jh+1 we have jh − 2 dc (B, C) = (1) h−1 since the length of the subchain connecting B and C is given by the number of different ways of choosing the h − 1 indices j1 , j2 , . . . , jh−1 in the subset {1, 2, . . . , jh − 2} having cardinality jh − 2. By iterating the application of (1), if A = {1, 2, . . . , h − 1, h, jh+1 , . . . , jl } we obtain that dc (A, C) =
j h −1 i=h
jh −h−1 i−1 i+h−1 jh − 1 = = h−1 h−1 h i=0
having used the identity r ν +k−1 ν=0
k−1
=
r+k k
which holds for all the positive integers r, k. Note that dc (A, C) is the length of the subchain covered when moving from h to jh the h-th element. The least element of Qln is A = {1, 2, . . . , l} and the distance from the set A to a generic set B = {j1 , j2 , . . . , jl }, with j1 < j2 < · · · < jl , can be obtained by summing up the length of the subchains covered when moving the h-th element from h to jh , for h = 1, . . . , l. This amounts to dc (A, B) =
l jh − 1 h=1
h
Now, if m = Cn,l define the mapping ω : Qln → Im as ω(B) = 1 + dc (A, B); the following result is readily proved. Theorem 3. If m = Cn,l the mapping ω is a bijection and an isometry. Consequently, ω is an isomorphism and Qln is isomorphic to Im .
22
M. Muselli
Since the mapping ω is 1-1, its inverse ω −1 : Im → Qln is uniquely determined when m ≤ Cn,l . According to Th. 3, ω −1 is 1-1 and an isometry; if in addition m = Cn,l , then ω −1 turns out to be a bijection. Consider the mirror mapping µ : {0, 1}n → {0, 1}n, which reverses the bits of a binary string x. The composition of µ, p and ω −1 generates a function γ : Im → Qln that maps the first m integers into the binary strings of the antichain Qln . Since µ, p and ω −1 are all 1-1, also is the composition γ = µ ◦ p ◦ ω −1 . It can be shown that the mapping γ, called henceforth lattice coding, is an isometry between Im and Qln if the counter metric is employed both on Im and on Qln . In addition, γ is order preserving if the lexicographic ordering is employed on Qln . However, it can be easily seen that γ is still an isometry when the flat metric is adopted both on Im and on Qln . Finally, γ is a bijection if m = Cn,l . Note that the only one coding, widely used to code nominal variables, is equivalent to a lattice coding with n = m and l = 1. Thus, it is always a bijection, an isometry and an order preserving mapping. Now, the approximation property of positive Boolean functions can be derived d → Qln . If the elements a of Qln are subby properly defining the mapping ϕ : Im (i) divided into d substrings a , the mapping ϕ can be obtained by concatenating the d binary strings produced by as many functions γ : Im → Qlnii . By Th. 2 every function f : Qln → {0, 1} can be extended to a positive Boolean function; thus, we can conclude that the mapping η defined on Ln as η = ψ◦f ◦ϕ, where ψ is the inverse of γ restricted to the subset γ(Im ), is onto the set of discrete functions. We have then established the following general result: Theorem 4. The class of positive Boolean functions is able to approximate arbitrarily well the set of Borel measurable functions.
References 1. Hornik, K., Stinchcombe, M., White, H.: Multilayer feedforward networks are universal approximators. Neural Networks 2 (1989) 359–366 2. Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural Computation 3 (1991) 246–257 3. Hammer, B., Gersmann, K.: A note on the universal approximation capability of support vector machines. Neural Processing Letters 17 (2003) 43–53 4. Boros, E., Hammer, P.L., Ibaraki, T., Kogan, A., Mayoraz, E., Muchnik, I.: An implementation of Logical Analysis of Data. IEEE Transactions on Knowledge and Data Engineering 12 (2000) 292–306 5. Muselli, M., Liberati, D.: Binary rule generation via Hamming Clustering. IEEE Transactions on Knowledge and Data Engineering 14 (2002) 1258–1268 6. Muselli, M.: Switching neural networks: A new connectionist model for classification. Accepted at WIRN ’05 - XVI Italian Workshop on Neural Networks (Vietri sul Mare, Italy, 2005). 7. Muselli, M., Quarati, A.: Reconstructing positive Boolean functions with Shadow Clustering. Accepted at ECCTD05 - 17th European Conference on Circuit Theory and Design (Cork, Ireland, August 2005).