Boolean Reasoning for Feature Extraction Problems Hung Son Nguyen, Andrzej Skowron Institute of Mathematics, Warsaw University Banacha 2, Warsaw Poland email:
[email protected];
[email protected] Abstract. We recall several applications of Boolean reasoning for fea-
ture extraction and we propose an approach based on Boolean reasoning for new feature extraction from data tables with symbolic (nominal, qualitative) attributes. New features are of the form a 2 V , where V Va and Va is the set of values of attribute a. We emphasize that Boolean reasoning is also a good framework for complexity analysis of the approximate solutions of the discussed problems.
1 Introduction "Feature Extraction" and "Feature Selection" are important problems in Machine Learning and Data Mining (see e.g. [6, 3, 4]). In previous papers we have considered problems like: short reduct nding problem [16], rule induction problem [17], optimal discretization problem [12], linear feature (hyperplane) searching problem [13]. Our solutions of these problems are based on Boolean reasoning schema [2]. In this paper we discuss a problem of searching for new features from a data table with symbolic (qualitative) values of attributes. This problem, called symbolic value partition problem diers from the discretization problem. We do not assume any pre-de ned order on values of attributes. Once again, we apply rough set method and Boolean reasoning to construct heuristics searching for relevant features of the form a 2 V Va generated by partitions of symbolic values of conditional attributes into a small number of value sets. We also point out that Boolean reasoning can be used as a tool to measure the complexity of approximate solution of a given problem. As a complexity measure of a given problem we propose the complexity of the corresponding to that problem Boolean function (represented by the number of variables, number of clauses, etc.). It is known that for some NP-hard problems it is easier to construct ecient heuristics than for the other ones. The problem of symbolic value partition is in this sense harder than the problem of optimal discretization problem.
2 Preliminaries
We consider the Boolean algebra over B = f0; 1g and n-variable Boolean function f : Bn ! B, where n 1.
For any sequence a = (a [1] ; : : : ; a [n]) 2 Bn and any vector of Boolean variables x = (x1 ; : : : ; xn ) we de ne the minterm ma and the maxterm sa by
ma (x) = xa1[1] ^ xa2[2] ^ : : : ^ xan[n] and sa (x) = x:1 a[1] _ x:2 a[2] _ : : : _ x:n a[n] where x1 = x and x0 = x: Theorem 1 . (see [18]) f (x) = W ma (x) = V sb (x) a2f ?1 (1)
b2f ?1 (0)
These two representations are called disjunctive (DNF) and conjunctive normal forms (CNF) of the function f , respectively. Let u = (u1 ; : : : ; un) ; v = (v1 ; : : : ; vn ) 2 f0; 1gn . We use the coordinate-wise ordering, i.e. u v if and only if ui vi for all i. A Boolean function f is called monotone i u v implies f (u) f (v). One can show that a Boolean function is monotone if and only if it can be de ned without negation [18]. Given aVset of variables S fx1 ; : : : ; xn g we de ne the monomial mS by xi : The set S of variables is called an implicant of the monotone mS (x) = xi 2S
Boolean function f if and only if m?S 1 (1) f ?1 (1). The set S of variables is called a prime implicant of a monotone Boolean function f if S is an implicant of f and any proper subset of S is not an implicant of f . We use the following properties of two problems related to monotone Boolean functions [2]: Theorem 2. [12] For a given monotone Boolean function f of n variables in CNF and an integer k. The decision problem for checking if there exists a prime implicant of f with at most k variables is NP -complete. The problem of searching for minimal prime implicant of f is NP -hard. An information system [15] is a pair A = (U; A), where U is a non-empty, nite set called the universe and A is a non-empty, nite set of attributes, i.e. a : U ! Va for a 2 A, where Va is called the value set of a. Elements of U are called objects. Any information system A = (U; A) and a non-empty set B A de ne a B -information function by InfB (x) = f(a; a(x)) : a 2 B for x 2 U g. The set fInfA(x) : x 2 U g is called the A?information set and denoted by INF (A). Any information system of the form A = (U; A [ fdg) is called decision table where d 2= A is called decision and the elements of A are called conditions. Let Vd = f1; : : :; r(d)g. The decision d determines the partition fC1 ; :::; Cr(d)g of the universe U , where Ck = fx 2 U : d(x) = kg for 1 k r(d). The set Ck is called the k ? th decision class of A. With any subset of attributes B A, an equivalence relation called the B -indiscernibility relation [15], denoted by IND(B ), is de ned by IND(B ) = f(x; y) 2 U U : 8a2B (a(x) = a(y))g Objects x; y satisfying relation IND(B ) are indiscernible by attributes from B . By [x]IND(B) we denote the equivalence class of IND (B ) de ned by x. A minimal subset B of A such that IND(A) = IND(B ) is called a reduct of A.
If A = (U; A [ fdg) is a decision table and B A then we de ne a function @B : U ! 2f1;::;r(d)g , called the generalized decision in A, by
@B (x) = fi : 9x0 2U [(x0 IND(B )x) ^ (d(x0 ) = i)]g = d [x]IND(B) A decision table A is called consistent (deterministic) if card (@A (x)) = 1 for any x 2 U , otherwise A is inconsistent (non-deterministic).
3 Some optimization problems in rough set theory 3.1 Minimal Reduct problem Let A be an information system with n objects and k attributes. By M (A) [16] we denote an n n matrix (cij ), called the discernibility matrix of A such that cij = fa 2 A : a(xi ) = 6 a(xj )g for i; j = 1; : : : ; n: A discernibility function fA for the information system A is a Boolean func-
tion of k Boolean variables a1 ; : : : ; ak corresponding to the attributes a1 ; : : : ; ak respectively, and de ned by V W fA (a1 ; : : : ; ak ) =df cij 6=; cij where cij = fa : a 2 cij g: The set of all prime implicants of fA determines the set of all reducts of A [16]. In the sequel, to simplify the notation, we omit the star ?superscripts. Observe that the Boolean function fA consists of k variables and O n2 clauses. A subset B of the set A of attributes of decision table A = (U; A [ fdg) is a relative reduct of A i B is a minimal set with respect to the following property: @B = @A . The set of all relative reducts in A is denoted by RED(A; d).
Theorem 3. [16]The decision problem for checking if there exist a (relative)
reduct of length < k is NP -complete. The searching problem for reduct of minimal length is NP -hard.
3.2 Discretization making Let A = (U; A [ fdg) be a decision table where U = fx ; x ; : : : ; xn g; A = fa ; :::; ak g and d : U ! f1; :::; rg. We assume Va = [la ; ra ) < to be a real interval for any a 2 A and A to be a consistent decision table. Any pair (a; c) where a 2 A and c 2 < will be called a cut on Va . Let Pa be a partition on Va (for a 2 A) into subintervals i.e. Pa = f[ca ; ca ); [ca ; ca ); : : : ; [caka ; caka )g for some integer ka , where la = ca < ca < ca < : : : < caka < caka = ra and Va = [ca ; ca ) [ [ca ; ca ) [ : : : [ [caka ; caka ). Hence any partition Pa is uniquely de ned a a a and often identi ed as the S set of cuts: f(a; c ); (a; c ); : : : ; (a; cka )g A