On establishing nonlinear combinations of ... - Semantic Scholar

Report 4 Downloads 140 Views
Information Sciences 280 (2014) 98–110

Contents lists available at ScienceDirect

Information Sciences journal homepage: www.elsevier.com/locate/ins

On establishing nonlinear combinations of variables from small to big data for use in later processing Jerry M. Mendel ⇑, Mohammad M. Korjani Signal and Image Processing Institute, Ming Hsieh Department of Electrical Engineering, University of Southern California, Los Angeles, CA 90089-2564, United States

a r t i c l e

i n f o

Article history: Received 29 December 2013 Received in revised form 29 March 2014 Accepted 25 April 2014 Available online 9 May 2014 Keywords: Big data Causal combination Fast processing Nonlinear combination Parallel and distributed processing Preprocessing

a b s t r a c t This paper presents a very efficient method for establishing nonlinear combinations of variables from small to big data for use in later processing (e.g., regression, classification, etc.). Variables are first partitioned into subsets each of which has a linguistic term (called a causal condition) associated with it. Our Causal Combination Method uses fuzzy sets to model the terms and focuses on interconnections (causal combinations) of either a causal condition or its complement, where the connecting word is AND which is modeled using the minimum operation. Our Fast Causal Combination Method is based on a novel theoretical result, leads to an exponential speedup in computation and lends itself to parallel and distributed processing; hence, it may be used on data from small to big.  2014 Elsevier Inc. All rights reserved.

1. Introduction Suppose one is given data of any size, from small to big, for a group of v input variables that one believes caused1 an output, and that one does not know which (nonlinear) combinations of the input variables caused the output. This paper presents a very efficient method for establishing the initial (nonlinear) combinations of variables that can then be used in later modeling and processing. For example, in nonlinear regression (e.g., [21,22]) one needs to choose the nonlinear interactions among the variables as well as the number of terms in the regression model,2 in pattern classification (e.g., [7,2]) that is based on mathematical features (e.g., [23]) one needs to choose the nonlinear nature of those features as well as the number of such features, and in some neural networks (e.g., [11]) one needs to know which combinations of the inputs and how many such combinations should be fanned out to one or more of the network’s various layers. Our Causal Combination Method (CCM) that is described in Section 3 provides the initial combinations of the variables as well as their number, and can also be used in later processing to readjust the combinations of the variables as well as their number that are used in a model. Our Fast Causal Combination Method (FCCM) that is also described in Section 3 is a very efficient way of implementing CCM for data of any size. Establishing which combinations of variables to use in a model can be interpreted as a form of data preprocessing. According to [24]: ‘‘Data preprocessing is a data mining technique that involves transforming raw data into an understandable ⇑ Corresponding author. Tel.: +1 213 740 4445; fax: +1 213 740 4456. E-mail address: [email protected] (J.M. Mendel). How to choose the variables is crucial to the success of any model. This paper assumes that the user has already established the variables that (may) affect the outcome. 2 According to [5, p. 20], ‘‘. . . in practice, due to complex and often informal nature of a priori knowledge, . . . specification of approximating functions may be difficult or impossible.’’ 1

http://dx.doi.org/10.1016/j.ins.2014.04.042 0020-0255/ 2014 Elsevier Inc. All rights reserved.

99

J.M. Mendel, M.M. Korjani / Information Sciences 280 (2014) 98–110

format. Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many errors. Data preprocessing is a proven method of resolving such issues. Data preprocessing prepares raw data for further processing.’’ According to [13] data preprocessing includes cleaning, normalization, transformation, feature extraction and selection. Our preprocessing is about transformation of raw data into patterns. CCM focuses on interconnections of either a causal condition (defined in Section 2) or its complement where the connecting word is AND which is modeled using the minimum operation. Note that, because you might be wrong about postulating a cause you protect yourself against this by considering both the cause and its complement (an idea that was first suggested3 by Ragin in [19 p. 131]). The interconnection of either a causal condition or its complement for all of the v variables is called a causal combination. As will be seen in Section 2, there can be a (very) large number of candidate causal combinations. CCM prunes the (very) large number of candidate causal combinations (using data about the v input variables) to a much smaller subset of surviving causal combinations, and FCCM does this in a very efficient way. These surviving causal combinations are the nonlinear combinations of the input variables that can then be used in later modeling or processing. In summary, this paper presents a very efficient method for establishing the initial (nonlinear) combinations of variables that can then be used in later modeling and processing by using a novel form of preprocessing that transforms raw data into patterns through the use of fuzzy sets. We will show that our method lends itself to massive distributed and parallel processing which makes it suitable for data of all sizes from small to big. The rest of this paper is organized as follows: Section 2 describes the terminology and approach that is used in the rest of the paper; Section 3 provides the main results for CCM and FCCM; Section 4 quantifies the computational speedup for FCCM; Section 5 provides some additional ways to speed up FCCM; and Section 6 draws conclusions. 2. Terminology and approach A data pair is (x(t), y(t)), where x(t) = col(x1(t), . . . , xv(t)), xi(t) is the ith input variable and y(t) is the output for that x(t). Each data pair is treated as a ‘‘case,’’ index t denotes a data case and there does not have to be a unique natural ordering of the cases over t (in a multi-variable approximation application there is no natural ordering of the data cases, but in a time-series forecasting application the data cases would have a natural temporal ordering). We assume that N data pairs are available, and refer to the collection of these data pairs as SCases, where:

SCases ¼ fðxðtÞ; yðtÞÞgNt¼1

ð1Þ 10

12

>12

For Big Data [10, Table I], N ranges from huge (O(N) = 10 ) to monster (O(N) = 10 ) to very large (O(N) = 10 ). We begin by partitioning each input variable into subsets each of which may be thought of as having a linguistic term4 associated with it, e.g., the variable Pressure can be partitioned into Low Pressure, Moderate Pressure and High Pressure. Because it is very difficult to know where to draw a crisp line between each of the subsets, so as to separate one from the other, they are modeled herein as fuzzy sets, and, there can be from 1 to nv subsets (terms) for each input variable. The terms for each input variable that are actually used in CCM are called causal conditions. If one chooses to use only one term for each variable (e.g., Profitable Company, Educated Country, Permeable Oil Field, etc.), then the words ‘‘variable,’’ ‘‘term’’ and ‘‘causal condition’’ can be interchanged, i.e., they are synonymous. If, on the other hand, one chooses to use more than one term for each variable (e.g., Low Pressure, Moderate Pressure and High Pressure), i.e., to granulate [1] each variable, as is very commonly done in engineering and computer science applications, then one must distinguish between the words ‘‘variable,’’ ‘‘term’’ and ‘‘causal condition.’’ We elaborate further on this next. If, e.g., there are V variables, each described by nv terms (v = 1, . . . , V) then (as in fsQCA [16,17,19,20]) each of the terms will be treated as a separate causal condition5 (this is illustrated below in Example 1), and, as a result, there will be k = n1 + n2 +    + nV causal conditions. We let nv (v = 1, . . . , V) denote a variable and Tl(nv) (l = 1, . . . , nv) denote the terms for each variable. For simplicity, in this paper the same numbers of terms are used for each variable, i.e. nv = nc for "v, so that the total number of causal conditions is k = ncV. The terms are organized according to the (non-unique) ordering of the V input variables, as fT 1 ðn1 Þ; . . . ; T nc ðn1 Þ; . . . ; T 1 ðnV Þ; . . . ; T nc ðnV Þg. This set of ncV terms is then mapped into an ordered set of possible causal conditions, SC 0 , as follows:

n o fT 1 ðn1 Þ; . . . ; T nc ðn1 Þ; . . . ; T 1 ðnV Þ; . . . ; T nc ðnV Þg ! C 01 ðn1 Þ; . . . ; C 0nc ðn1 Þ; . . . ; C 0nc ðV1Þþ1 ðnV Þ; . . . ; C 0nc V ðnV Þ n o  C 01 ; . . . ; C 0nc ; . . . ; C 0nc ðV1Þþ1 ; . . . ; C 0nc V

ð2Þ

3 Traditional interconnections usually do not consider both a cause and its complement; in fact, one almost never sees the complement of a cause in an interconnection of causes (e.g., in the antecedents of either a crisp or fuzzy rule). 4 The actual names that are given to the subsets are not important for this paper, e.g., they may be given linguistically meaningful names (as in our example of Pressure) or symbolic names (e.g., A, B, C; T1, T2, T3; etc.). 5 One may raise an objection to doing this because of perceived correlations between terms for the same variable (e.g., perhaps Low Pressure and Moderate Pressure are highly correlated, or Moderate Pressure and High Pressure are highly correlated). Such perceptions depend on how overlapped the fuzzy sets are for the terms and does not have to be accounted for during CCM because the mathematics for CCM will take care of the overlap automatically.

100

J.M. Mendel, M.M. Korjani / Information Sciences 280 (2014) 98–110

A subset of the C 0i0 possible causal conditions, SC, chosen (by the end-user) from SC 0 forms the finite space of causal conditions, where SC  fC i 2 SC 0 ; i ¼ 1; . . . ; nc Vg. It is not uncommon for one to choose SC ¼ SC 0 . A causal combination is a connection of k = ncV conditions, each of which is either a causal condition or its complement for each variable. We treat the causal conditions as fuzzy sets and determine membership functions (MFs) for them ([14,19] provide ways to obtain the MFs; Fuzzy c-Means (FCM) can also be used to do this, [3] and to perform FCM for very large data, see [10]); however, the exact way in which these MFs are obtained is not needed in the rest of this paper), i.e.

lCl : Nv # R ! ½0; 1 nv # lC l ðnv Þ

) l ¼ 1; . . . ; nc ;

v ¼ 1; . . . ; V

ð3aÞ

which can be expressed as:

lCl ðnv Þ ! flCi ji ¼ 1; . . . ; k ¼ nc Vg

ð3bÞ

These MFs are evaluated for all N cases,6 the results being called derived MFs, i.e. (t = 1, . . . , N and i = 1, . . . , k)

lDCi : ðSCases ; SC Þ ! ½0; 1 t # ni ðtÞ # lC i ðni ðtÞÞ  lDCi ðtÞ

) ð4Þ

Next, we7 conceptually postulate 2k (k = ncV) candidate causal combinations. Let SF be the finite space of 2k candidate causal combinations, Fi, i.e.

SF ¼ fF 1 ; . . . ; F 2k g 3 F j ¼ Aj1 ^ Aj2 ^    ^ Ajk Aji ¼ C i or ci

) ðj ¼ 1; . . . ; 2k and i ¼ 1; . . . ; kÞ

ð5Þ

where ci denotes the complement of Ci and ^ denotes conjunction and is modeled using minimum. Our preprocessing that is described in Section 3 focuses on reducing the number of elements in SF from 2k, which can be very large, to a much smaller number. Example 1. Regarding the allocation of money in a number of investment alternatives [18, Chapter 7], let x1 = risk of losing capital, x2 = vulnerability to inflation, x3 = amount of profit received, and x4 = liquidity. In this example, it is assumed that x1, x2 and x3 are each described by the same three terms: Low (L), Moderate (M) and High (H), whereas x4 is described by Bad (B), Fair (F) and Good (G). Because V = 4 and nc = 3, k = ncV = 12, and:

fT 1 ðx1 Þ; . . . ; T 3 ðx1 Þ; . . . ; T 1 ðx4 Þ; . . . ; T 3 ðx4 Þg ¼ fLowðx1 Þ; Moderateðx1 Þ; Highðx1 Þ; Lowðx2 Þ; Moderateðx2 Þ; Highðx2 Þ; Lowðx3 Þ; ðx3 Þ; Highðx3 Þ; Badðx4 Þ; Fairðx4 Þ; Goodðx4 Þg  fC 1 ; C 2 ; C 3 ; . . . ; C 10 ; C 11 ; C 12 g There will be 212 = 4096 candidate causal combinations, each with 12 terms that are connected by AND, one example of which is8:

F ¼ ðLx1 ^ mx1 ^ hx1 Þ ^ ðlx2 ^ M x2 ^ hx2 Þ ^ ðlx3 ^ mx3 ^ hx3 Þ ^ ðbx4 ^ fx4 ^ Gx4 Þ 3. Main results for the Causal Combination Method (CCM) and the Fast Causal Combination Method (FCCM) In this section we present the details for the CCM and FCCM. 3.1. Causal Combination Method (CCM) In CCM one actually creates the 2k (k = ncV) candidate causal combinations, computes each of their MFs in all of the N cases, and then keeps only the ones—the RS surviving causal combinations—whose MF values are >0.5 (our reason for doing this is explained below in Comment 1) for an adequate number of cases (which must be specified by the user). This is a mapping from {SF, SCases} into SF S that makes use of lAj ðtÞ, where SF S is a subset of SF, with RS elements, i.e. (j = 1, . . . , 2k, t = 1, . . . , N i and l = 1, . . . , RS) [16]

6

In Section 6 we describe how this can be done using parallel or distributed processing. In CCM one must actually enumerate (create) all of the 2k candidate causal combinations. As we show below, a distinguishing feature of FCCM is that we only have to conceptually postulate them. 8 How to simplify an interconnection of three terms for the same variable is not needed for CCM or FCCM, but may be something that someone is interested in doing at the very end of their final processing in order to make such an interconnection more linguistically interpretable. Discussions on how to do this that are based on the similarity of fuzzy sets can be found in [14]. 7

J.M. Mendel, M.M. Korjani / Information Sciences 280 (2014) 98–110

lF j : ðSF ; SCases Þ ! ½0; 1 t # lF j ðtÞ ¼ min

n

o

9 =

ð6Þ

lAj ðtÞ; lAj ðtÞ; . . . ; lAj ðtÞ ; 1

2

101

k

lAj ðtÞ ¼ lDCi ðtÞ or lDci ðtÞ ¼ 1  lDCi ðtÞ i ¼ 1; . . . ; k

ð7Þ

i

tF j : ð½0; 1; SCases Þ ! f0; 1g ( 1 if lF j ðtÞ > 0:50 t # t F j ðtÞ ¼ 0 if lF j ðtÞ 6 0:50 NF j : f0; 1g ! I tF j # NF j ¼

9 > > =

ð8Þ

> > ;

9 > =

N X tF j ðtÞ > ;

ð9Þ

x¼1

F Sl : ðSF ; IÞ ! SF S

) ð10Þ

F j # F Sl ¼ fF j ðj ! lÞjNF j P f ; j ¼ 1; . . . ; 2k g 0

where F j0 ðj ! lÞ means F j0 is added to the set of surviving causal combinations as F Sl , and l is the index of the surviving set. In (10) f is an integer frequency threshold that must be set by the user (see Comment 2 below). The surviving causal combinations are denoted F Sl with associated re-numbered MFs lF S ðtÞ; l ¼ 1; . . . ; RS . l

Comment 1. Each of the 2k candidate causal combinations can be interpreted as a corner in a 2k-dimensional vector space [20, Chapter 5]. Paraphrasing [16, p. 7] and [17, footnote 6]: Choosing the surviving causal combinations according to (6)– (10) can be interpreted as keeping the adequately represented causal combinations that are closer to corners and not the ones that are farther away from corners. Regarding closeness to corners in a 2k-dimensional vector space, if crisp sets were used instead of fuzzy sets, a candidate causal combination is either fully supported (i.e., its MF value equals 1) or is not supported at all (i.e., its MF value equals 0), and only the fully supported candidate causal combinations survive. Using fuzzy sets lets one back off from the stringent requirement of using crisp sets, by replacing the vertex membership value of ‘‘1’’ with a vertex membership value of >0.5, meaning that if the MF value for a causal combination is greater than 0.5 then the causal combination is closer to a vertex than it is away from its vertex. Only those cases whose causal combination MF values are greater than 0.5 are said to support the existence of a candidate causal combination. Comment 2. In order to implement (10) threshold f (cut-off frequency) has to be chosen. This choice is arbitrary and depends on an application and how many cases are available. Discussions on how to choose f are given in [20, Ch. 5, p. 107], and [19, p. 197]. Often f is set equal to the value of N F j0 that captures more than 80% of the cases assigned to causal combinations [8]; alternatively, it is sometimes chosen as 1. When the latter is done it is not uncommon for there to be many cases that are associated with this cut-off frequency. They must all be kept, because as of yet there is no natural way to determine which of these cases is more important than the other. Comment 3. A brute force way to carry out the CCM computations in (6)–(10) is to create a table in which there are N rows, one for each case, and 2k ¼ 2nc V columns, one for each of the causal combinations. The entries into this table are lF j ðtÞ and there will be N2nc V such entries. Such a table is called a Truth Table by Ragin [19,20]. One then searches through this very large table and keeps only those causal combinations whose MF entries are >0.5. If f = 1 then all such causal combinations, removing duplications, become the set of RS surviving causal combinations. For Big Data (and even for not-so-big data), N2nc V will be enormous and so this brute force way to carry out the computations in (6)–(10) is totally impractical. 3.2. Fast Causal Combination Method (FCCM) Ragin [19, p. 131] observed the following in an example with four causal conditions: ‘‘. . . each case can have (at most) only a single membership score greater than 0.5 in the logical possible combinations from a given set of causal conditions [i.e., in the candidate causal combinations].’’ This somewhat surprising result is true in general and in [17] the following theorem that locates the one causal combination for each case whose MF > 0.5 was presented: Theorem 1 (Min–Max Theorem [17]). Given k causal conditions, C1, C2, . . . , Ck and their respective complements, c1, c2, . . . , ck. Consider the 2k candidate causal combinations ðj ¼ 1; . . . ; 2k Þ F j ¼ Aj1 ^ Aj2 ^    ^ Ajk where Aji ¼ C i or ci and i = 1, . . . , k. Let

102

J.M. Mendel, M.M. Korjani / Information Sciences 280 (2014) 98–110

n

o

lF j ðtÞ ¼ min lAj ðtÞ; lAj ðtÞ; . . . ; lAj ðtÞ ; t ¼ 1; 2; . . . ; N 1

2

Then for each t (case) there is only one j, j⁄(t), for which

n

ð11Þ

k



lFj ðtÞ ðtÞ > 0:5 and lF j ðtÞ ðtÞ can be computed as:

  o lF j ðtÞ ðtÞ ¼ min max lDC1 ðtÞ; lDc1 ðtÞ ; . . . ; max lDCk ðtÞ; lDck ðtÞ F j ðtÞ ðtÞ is determined from the right-hand side of (12), as:

F j ðtÞ ðtÞ ¼ arg max









ð12Þ





lDC1 ðtÞ; lDc1 ðtÞ    arg max lDCk ðtÞ; lDck ðtÞ , A1j ðtÞ ^    ^ Akj ðtÞ





ð13Þ



 In (13), arg max lDCi ðtÞ; lDci ðtÞ denotes the winner of max lDCi ðtÞ; lDci ðtÞ , namely Ci or ci. For completeness, a proof of this theorem is given in Appendix A, because this theorem is the basis for the following FCCM, which is a very efficient (fast) method for computing (6)–(10)9: (1) Compute F j ðtÞ ðtÞ using (13). 0 (2) Find the J uniquely different F j ðtÞ ðtÞ and re-label them F j0 ðj ¼ 1; . . . ; JÞ. 0 (3) Compute tF j0 , where (t = 1, . . . , N; j = 1, . . . , J)

t F j0 ðtÞ ¼



1 if F j0 ¼ F j ðtÞ ðtÞ 0

otherwise

ð14Þ

0

(4) Compute N F j0 ðj ¼ 1; . . . ; JÞ, where

N F j0 ¼

N X t F j0 ðtÞ

ð15Þ

x¼1

(5) Establish the RS surviving causal combinations F Sl ðl ¼ 1; . . . ; RS Þ, as (j0 = 1, . . . , J):

(

F Sl ¼

0

F j0 ðj ! lÞ if NF j0 P f 0

if NF j0 < f

ð16Þ

From the structure of F j ðtÞ ðtÞ in the second line of (13), F sl in (16) can be expressed10 as:

F Sl ðtÞ ¼ Al1 ^    ^ Alk

ð17Þ

where l = 1, . . . , RS and t = 1, . . . , N.

Comment 4. Observe from Steps 1–5 that the explicit enumeration of all 2k ¼ 2nc V candidate causal combinations is not required in FCCM; it is required in CCM. Example 2. In order to illustrate the FCCM computations we consider11 a simplified Auto MPG12 application for Low MPG cars (from 14 four-cylinder automobiles13). We selected three input variables, namely, Horsepower (H), Weight (W) and Acceleration (A), and two terms (Low and High) for each variable; hence, there are six causal conditions: LH = Low Horsepower, HH = High Horsepower, LW = Light Weight, HW = Heavy Weight, LA = Low Acceleration and HA = High Acceleration. The data in Table 1 show the variables and derived MFs. How these MF values were actually obtained is explained in [17]. For six causal conditions there are 26 = 64 candidate causal combinations whose MFs have to be evaluated for all 14 cases. Using the min–max formulas from Theorem 1 we found the winning causal combination for each case. These results are summarized in Table 2. Observe from the last column in this table that only six uniquely different, out of the 64 possible, causal combinations have survived, namely:

lH HH lW hW LA hA ; lH hH lW hW LA hA ; LH hH LW hW lA HA ; lH hH lW HW lA hA ; lH HH lW hW lA hA ; LH hH lW hW lA HA : The membership function grades for these six surviving causal combinations for all 14 cases are summarized in Table 3, e.g. for Case 1 (the numerical values for the membership grades or their complements are taken from Table 1)

9

FCCM is modeled after Step 6NEW in Fast fsQCA, as described in [17]. Note that in FCCM j⁄(t) ? j0 ? l. 11 This example and its tables are taken from Examples 1 and 2 in [17]; it is included in this paper in order to illustrate the FCCM, because a reviewer of this paper felt an example would help the readers better grasp the ideas of FCCM. 12 The MPG data set can be obtained at: http://archive.ics.uci.edu/ml/datasets/Auto+MPG. 13 The numbered cases correspond to the following cars: 1 – Toyota Corona Mark II, 2 – Datsun pl510 (70), 3 – Volkswagen 1131 Deluxe Sedan, 4 – Peugeot 504, 5 – Audi 100 LS, 6 – Saab 99e, 7 – BMW 2002, 8 – Datsun pl510 (71), 9 – Chevrolet Vega 2300, 10 – Toyota Corona, 11 – Chevrolet Vega (sw), 12 – Mercury Capri 2000, 13 – Opel 1900, 14 – Plymouth Cricket. These cars all had MF values greater than zero in Low MPG cars. 10

103

J.M. Mendel, M.M. Korjani / Information Sciences 280 (2014) 98–110

Table 1 Data- and fuzzy-membership-matrix (showing original variables and their derived fuzzy-set membership function scores; adapted from Table 1 in [17]). Case

Causal condition and derived MF scores

1 2 3 4 5 6 7 8 9 10 11 12 13 14

H

MF(LH)

MF(HH)

W

MF(LW)

MF(HW)

A

MF(LA)

MF(HA)

95 88 46 87 90 95 113 88 90 95 72 86 90 70

0 0 1 0 0 0 0 0 0 0 0.75 0 0 0.89

0.91 0.12 0 0.06 0.33 0.91 1 0.12 0.33 0.91 0 0.02 0.33 0

2372 2130 1835 2672 2430 2375 2234 2130 2264 2228 2408 2220 2123 1955

0 0.31 1 0 0 0 0 0.31 0 0.01 0 0.02 0.35 0.99

0.08 0 0 0.97 0.22 0.08 0 0 0 0 0.16 0 0 0

15 14.5 20.5 17.5 14.5 17.5 12.5 14.5 15.5 14 19 14 14 20.5

0.92 1 0 0 1 0 1 1 0.63 1 0 1 1 0

0 0 1 0.06 0 0.06 0 0 0 0 0.8 0 0 1

Table 2 Min–max calculations and associated causal combinations (taken from Table 1 in [17]). Case Maximum (MF, complement of MF)/winner (W)

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Minimum calculation Causal combination [Using (12)] [using (13)]

Max(LH, lH)/W

Max(HH, hH)/W Max(LW, lW)/W Max(HW, hW)/W

Max(LA, lA)/W

Max(HA, hA)/W

1/lH 1/lH 1/LH 1/lH 1/lH 1/lH 1/lH 1/lH 1/lH 1/lH 0.75/LH 1/lH 1/lH 0.89/LH

0.91/HH 0.88/hH 1/hH 0.94/hH 0.67/hH 0.91/HH 1/HH 0.88hH 0.67/hH 0.91/HH 1/hH 0.98/hH 0.67/hH 1/hH

0.92/LA 1/LA 1/lA 1/lA 1/LA 1/lA 1/LA 1/LA 0.63/LA 1/LA 1/lA 1/LA 1/LA 1/lA

1/hA 1/hA 1/HA 0.94/hA 1/hA 0.94/hA 1/hA 1/hA 1/hA 1/hA 0.8/HA 1/hA 1/hA 1/HA

1/lW 0.69/lW 1/LW 1/lW 1/lW 1/lW 1/lW 0.69/lW 1/lW 0.99/lW 1/lW 0.98/lW 0.65/lW 0.99/LW

0.92/hW 1/hW 1/hW 0.97/HW 0.78/hW 0.92/hW 1/hW 1/hW 1/hW 1/hW 0.84/hW 1/hW 1/hW 1/hW

0.91 0.69 1 0.94 0.67 0.91 1 0.69 0.63 0.91 0.75 0.98 0.65 0.89

lHHHlWhWLAhA lHhHlWhWLAhA LHhHLWhWlAHA lHhHlWHWlAhA lHhHlWhWLAhA lHHHlWhWlAhA lHHHlWhWLAhA lHhHlWhWLAhA lHhHlWhWLAhA lHHHlWhWLAhA LHhHlWhWlAHA lHhHlWhWLAhA lHhHlWhWLAhA LHhHLWhWlAHA

MGðlH HH lW hW LA hA j Case 1Þ ¼ minð1; 0:91; 1; 0:92; 0:92; 1Þ ¼ 0:91 Observe, from Table 3, that for each case there is only one causal combination for which its MF is greater than 0.5; it is shown in boldface and illustrates the truth of the Min–Max Theorem. Example 3. In this example we illustrate the number of surviving causal combinations for eight readily available data sets: Abalone [9], Concrete Compressive Strength [9], Concrete Slump Test [9], Wave Force [12], Chemical Process Concentration Readings [4], Chemical Process Temperature Readings [4], Gas Furnace [4] and Mackey–Glass Chaotic Time Series [6]. Our results are summarized in Table 4. To begin we found (one, two or three) MFs for each problem’s variables using a modification of Fuzzy c-Means (FCM) [3] called Linguistically Modified Fuzzy c-Means (LM-FCM) that is described in14 [14]. For one term per variable we ran FCMs for two clusters, and, because the two MFs are complements of one another,15,16 we chose any one of the two as the term’s MF. For two terms per variable we ran FCMs for three clusters and assigned Low to the left-most cluster and High to the right-most cluster. For three terms per variable, we ran FCMs for five clusters and assigned Low to the left-most cluster, Moderate to the middle cluster and High to the right-most cluster. Of course, there are other ways to arrive at the MFs; and, since Theorem 1 is very dependent upon the MFs, the results in Table 4 will be different for different choices for the MFs.

14 FCM MFs have some problems when they are used for linguistic terms, i.e., they are not completely shoulder or interior MFs. LM-FCM modifies FCM MFs so that the resulting MFs are completely shoulder or interior MFs and are therefore more linguistically plausible. 15 In FCMs [3], the sum of all MFs for each case always adds to one. 16 Some may object to our referring to the two cluster situation as ‘‘one term per variable’’ and may instead choose to refer to it as ‘‘two terms per variable,’’ in which case there would not be a Table 4 column labeled ‘‘one term per variable.’’ We chose not to do this because when FCM is used to find the MFs then, as mentioned in the text, there is a complimentary relationship between the MFs of the two terms. If, however, FCM is not used to find the MFs for two terms, then such a complimentary relationship between the two MFs would not (necessarily) occur.

104

J.M. Mendel, M.M. Korjani / Information Sciences 280 (2014) 98–110

Table 3 Membership grades for the six surviving causal combinations and 14 cases; membership grades for the six causal conditions are in Table 1 (taken from Table 1 in [17]). Case

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Membership grades of surviving causal combinations (i.e., minimum of six causal-condition-MFs) lHHHlWhWLAhA

lHhHlWhWLAhA

LHhHLWhWlAHA

lHhHlWHWlAhA

lHHhlWhWlAhA

LHhHlWhWlAHA

0.91 0.12 0 0 0.33 0 1 0.12 0.33 0.91 0 0.02 0.33 0

0.09 0.69 0 0 0.67 0 0 0.69 0.63 0.09 0 0.98 0.65 0

0 0 1 0 0 0 0 0 0 0 0 0 0 0.89

0.08 0 0 0.94 0 0.08 0 0 0 0 0.16 0 0 0

0.08 0 0 0.03 0 0.91 0 0 0.33 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0.75 0 0 0.01

Each boldface membership grade is greater than 0.5.

Focusing next on the numbers in Table 4, observe that the number of surviving causal combinations (the shaded columns) always increases as the number of terms per variable increases for all eight problems. This makes sense from an information viewpoint, because describing a variable with more terms implies more information is being extracted from the data, and this requires more causal combinations to do it. Observe, also that: (1) When there is only one term per variable and three variables (as occurs for Wave Force, Chemical Process Concentration Reading and Chemical Process Temperature Readings), then the number of surviving causal combinations is either the same number—8, or close to the same number—6, as the number of candidate possible causal combinations—8. This suggests (to us) that one can (should) extract more information from these data sets by using more than one term per variable. (2) In all other situations the number of surviving causal combinations is considerably smaller than the number of candidate or plausibly possible causal combinations17. This difference increases when more terms per variable are used, e.g., using three terms per variables the candidate causal combinations for the Concrete Slump Test data set is 134,217,728 whereas the number of surviving causal combinations is 97. (3) Abalone has seven variables and 4177 cases, and only 55/128, 118/2187 and 352/16385 surviving causal combinations occurred when one, two or three terms were used per variable; these are very significant reductions from the plausibly possible number of causal combinations (128, 2187 and 16385, respectively). (4) Concrete Compressible Strength and Concrete Slump Test have even greater reductions from the plausibly possible number of causal combinations to the number of surviving causal combinations. Concrete Slump Test is interesting because there are so few cases (103). If one wants to use three (two) terms per variable then there are 97 (90) surviving causal combinations, almost one per case, indicating (to us) that it is not a good idea to use three (two) terms per variable for this data set when so few cases are available. (5) Forecasting the Mackey–Glass Chaotic Time Series is a very well studied problem, in which it is quite common to begin with four past samples (variables) and exactly 16 rules. This is in agreement with using two terms per variable where the number of surviving causal combinations turns out to be 16. Example 4. In this example we illustrate the surviving causal combinations for the Abalone [9] data set. Abalone has seven variables, namely: Length, Diameter, Height, Whole Weight, Shucked Weight, Viscera Weight, and Shell Weight. We used one MF for each variable, as described in Example 3, and let18 H1  High Length, H2  High Diameter, H3  High Height, H4  High Whole Weight, H5  High Shucked Weight, H6  High Viscera Weight, and H7  High Shell Weight. Table 5 enumerates the 55 surviving causal combinations. Its last column indicates the number of cases associated with each rule. Observe that the first six causal combinations cover 3439 cases which are more than 82% of all of the cases; so, if one uses the 80% rule that is mentioned in Comment 2, this could be done with only six causal combinations.

17

See Comment 5 that is after Example 4 for an explanation of what is meant by a plausibly possible causal combination. You may not agree with our linguistic use of ‘‘High’’ for all of the variables; however, as explained in Footnote 4, the actual names given to subsets are not important for using the results in this paper. 18

Table 4 Number of surviving causal combinations for eight problems. Problem

a

Variables (V)

One term per variable

Two terms per variable

Candidate/possible causal combinationsa

Surviving causal combinations

Candidate causal combinationsb

Plausibly possible causal combinationsc

Three terms per variable Surviving causal combinations

Candidate causal combinationsd

Plausibly possible causal combinationse

Surviving causal combinations

4177 1030

7 8

128 256

55 73

16,384 65,536

2187 6561

118 218

2,097,152 16,777,216

16,384 65,536

352 439

103

9

512

71

262,144

19,683

90

134,217,728

262,144

97

317 194

3 3

8 8

8 8

64 64

27 27

20 22

512 512

64 64

28 24

223

3

8

6

64

27

10

512

64

16

293 1000

6 4

64 16

25 8

4096 256

729 81

66 16

262,144 4096

4096 256

113 22

This is 2V because all candidate causal combinations are possible (see Comment 5 that is after Example 4). This is (22)V = 4V. c Instead of 4 candidate causal combinations per variable, there are only 3 plausibly possible candidate causal combinations per variable; hence, the total number of plausibly possible candidate causal combinations is 3V (see Comment 5 that is after Example 4). d This is (23)V = 8V. e Instead of 8 candidate causal combinations per variable, there are only 4 plausibly possible candidate causal combinations per variable; hence, the total number of plausibly possible candidate causal combinations is 4V (see Comment 5 that is after Example 4). b

J.M. Mendel, M.M. Korjani / Information Sciences 280 (2014) 98–110

Abalone [9] Concrete compressive strength [9] Concrete slump test [9] Wave force [12] Chemical process concentration reading [4] Chemical process temperature readings [4] Gas furnace [4] Mackey–glass chaotic time series [6]

Cases

105

106

J.M. Mendel, M.M. Korjani / Information Sciences 280 (2014) 98–110

Table 5 Surviving causal combinations and associated number of cases associated for Abalone data set.a

a

Surviving causal combination number

H1

H2

H3

H4

H5

H6

H7

Number of cases

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55

0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 1 1 0 0 1 0 0 1 1 0 0 1 1 1 0 1 0 0 1 0

0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 1 1 0 1 0 1 0 1 0 1 1 1 1 1 1 1 1 0 1

0 1 1 0 0 0 0 1 0 0 0 1 0 1 0 1 0 0 1 1 1 1 1 0 0 0 0 0 1 1 0 1 0 1 0 1 1 0 1 1 0 0 0 0 1 1 0 0 0 1 1 0 1 1 0

0 1 1 1 0 1 1 1 0 0 0 1 0 0 1 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 1 1 1 0 1 1 1 0 1 0 0 1 1 0 1 0 0 0 1 0 1 1 1 1

0 1 1 1 1 1 1 1 0 0 1 1 0 0 0 1 1 1 0 1 0 1 0 1 0 1 1 1 1 0 0 1 0 0 0 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 0 0 0 0 1

0 1 1 1 0 1 1 1 1 0 1 1 1 0 1 1 0 1 1 0 1 0 0 1 0 1 0 0 1 1 0 0 1 0 0 1 1 0 0 0 1 1 0 1 1 0 0 1 0 1 0 1 1 1 1

0 1 1 1 0 0 1 1 0 1 0 1 1 0 1 0 1 1 1 1 0 0 1 1 1 0 0 1 0 1 0 0 0 1 0 0 1 1 1 1 1 0 1 0 1 1 1 0 0 0 1 1 1 1 0

1443 1362 229 181 124 100 75 74 64 58 52 33 33 31 30 28 28 25 24 22 15 13 13 12 11 10 9 9 8 8 8 6 5 4 4 3 3 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

In this Table 1 (0) denotes the presence of the causal condition (complement of the causal condition).

Comment 5. Mendel and Korjani [17] proved that if each of the V variables is described by only one term (nc = 1), then all of the 2k = 2V candidate causal combinations can occur, i.e. are said to be possible, meaning that there may be actual cases for which every one of the 2k candidate causal combinations has19 a MF > 0.5. However, if one or more of the variables are described by more than one term, then not all of the 2k ¼ 2nc V candidate causal combinations are possible, meaning that some

19 To prove this [17], sketch the MFs for, e.g. Low (L) and Not Low (l). Let n1 denote the point at which lL(n) = ll (n). It will be obvious that in the Min–Max     Theorem arg max lDL ðnÞ; lDl ðnÞ ¼ L if n(x) 6 n1 and arg max lDL ðnÞ; lDl ðnÞ ¼ l if n (x) > n1. Since both L and l may occur (regardless of actual Case data) all k postulated causal combinations Fj (j = 1, . . . , 2 ) are said to be possible when only one term is assigned to each variable.

J.M. Mendel, M.M. Korjani / Information Sciences 280 (2014) 98–110

107

of the 2k candidate causal combinations can never have a MF > 0.5, and this does not depend on the cases (i.e., on the data) but instead derives from the mathematics that is associated with

n







o

lF j ðtÞ ðtÞ ¼ min max lDC1 ðtÞ; lDc1 ðtÞ ; . . . ; max lDCk ðtÞ; lDck ðtÞ

> 0:5:

For example, if a variable is described by two terms, High (H) and Low (L), then of the four possible combinations20 LH, lH, Lh and lh, when the MF of L is sufficiently to the left of the MF of H (as usually occurs21) then it is theoretically impossible for LH to occur because MF(LH) will always be > >     > > > < F j ðtÞ ðtjN2  N1 Þ ¼ arg max lDC ðtÞ; lDc ðtÞ    arg max lDC ðtÞ; lDc ðtÞ ; t ¼ N1 þ 1; . . . ; N2 1 1 k k > .. > > > . >     > > : F  ðtjN  N Þ ¼ arg max lD ðtÞ; lD ðtÞ    arg max lD ðtÞ; lD ðtÞ ; D j ðtÞ C1 c1 Ck ck

ð22Þ

t ¼ ND þ 1; . . . ; N

where the set of N cases is distributed into D subsets. The final surviving causal combination can then be obtained by combining the result of distributed computing, as

F j ðtÞ ðtÞ ¼ F j ðtÞ ðtjN1 Þ _ F j ðtÞ ðtjN2  N1 Þ _    _ F j ðtÞ ðtjN  ND Þ:

ð23Þ

6. Conclusions This paper has presented a very efficient method for establishing the initial (nonlinear) combinations of variables that can then be used in later modeling and processing (e.g., regression, classification, neural networks, etc.) by using a novel form of preprocessing that transforms raw data into patterns through the use of fuzzy sets. Our method lends itself to massive distributed and parallel processing which makes it suitable for data of all sizes from small to big. Variables are first partitioned into subsets each of which has a linguistic term (called a causal condition) associated with it. Our Causal Combination Method uses fuzzy sets to model the terms and focuses on interconnections (causal combinations) of either a causal condition or its complement, where the connecting word is AND which is modeled using the minimum operation. Our Fast Causal Combination Method is based on a novel theoretical result, leads to an exponential speedup in computation and lends itself to parallel and distributed processing; hence, it can be used on data from small to big. Although this paper has focused only on establishing nonlinear combinations of variables from data, a reader may be wondering what one can do with the final FCCM surviving causal combinations. It is important to realize that FCCM is not limited to Big Data. It can also be applied to moderate and small data sets. For small to moderate data sets, one can use the surviving causal combinations to perform regression, classification and fsQCA. There is a lot of research going on worldwide to figure out how to do regression and classification for Big Data. In [15] Korjani and Mendel have shown how the surviving causal combinations can be used in a new regression model, called Variable Structure Regression (VSR). Using the FCCM surviving causal combinations one can simultaneously determine the number of terms in the (nonlinear) regression model as well as the exact mathematical structure for each of the terms (basis functions). VSR has been tested on the eight classical (and readily available) small to moderate size data sets that are stated in Table I (four are for multi-variable function approximation and four are for forecasting), has been compared against five other methods, and has ranked #1 against all of them for all of the eight data sets. Surviving causal combinations have also been used to obtain linguistic summarizations using Fuzzy Set Qualitative Comparative Analysis (fsQCA) [14,16,17]. How to extend the results in this paper to type-2 fuzzy sets is under study.

110

J.M. Mendel, M.M. Korjani / Information Sciences 280 (2014) 98–110

Appendix A. Proofs A.1. Proof of Theorem 1 (taken from [17, Appendix A.1]) When F j ¼ Aj1 ^ Aj2 ^    ^ Ajk , where Aji ¼ C i or ci, then lently, as:

n

lAj ðtÞ ¼ min lDCi ðtÞ; lDci ðtÞ

o

or

i

lF j ðtÞ is given by (11), where lAj ðtÞ ¼ lDCi ðtÞ or lDci ðtÞ, or equivai

n

o max lDCi ðtÞ; lDci ðtÞ

ðA:1Þ

where

8 n o > < max lDCi ðtÞ; lDci ðtÞ > 0:5 n o > : min lDC ðtÞ; lDc ðtÞ 6 0:5 i i

ðA:2Þ

Consequently, only if lF j ðtÞ ðtÞ is given by (12) can (11) have a MF value that is >0.5. Observe that (13) is an immediate consequence of (12). h A.2. Proof of Corollary 1–1 (taken from [17, Appendix A.2]) It is easy to prove (20) by using (13) and mathematical induction. Here the proof is illustrated for k = 3. From (13), it follows that:

F j ðtÞ ðtjC 1 ; C 2 Þ ¼ arg max







lDC1 ðtÞ; lDc1 ðtÞ arg max lDC2 ðtÞ; lDc2 ðtÞ

F j ðtÞ ðtjC 1 ; C 2 ; C 3 Þ ¼ arg max









ðA:3Þ 

lDC1 ðtÞ; lDc1 ðtÞ arg max lDC2 ðtÞ; lDc2 ðtÞ arg max



lDC3 ðtÞ; lDc3 ðtÞ



ðA:4Þ

Comparing (A.3) and (A.4), it is easy to see that:

F j ðtÞ ðtjC 1 ; C 2 ; C 3 Þ ¼ F j ðtÞ ðtjC 1 ; C 2 Þarg max





lDC3 ðtÞ; lDc3 ðtÞ

ðA:5Þ

which is (20). References [1] A. Bargiela, W. Pedrycz, Granular Computing: An Introduction, Springer, New York, 2003. [2] N. Bharill, A. Tiwari, Handling big data with fuzzy based classification approach, in: M. Jamshidi, V. Kreinovich, J. Kacprzyk (Eds.), Advance Trends in Soft Computing: Proceedings of WCSC 2013, December 16–18, San Antonio, TX, USA, Springer, New York, 2014. [3] J.C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms, Kluwer, Norwell, MA, 1981. [4] G.E.P. Box, G.M. Jenkins, G.C. Reinsel, Time Series Analysis: Forecasting and Control, third ed., Prentice-Hall, Englewood Cliffs, NJ, 1994. [5] V. Cherkassky, F. Mulier, Learning From Data: Concepts, Theory and Methods, John Wiley & Sons, Inc., New York, 1998. [6] R.S. Cowder, Predicting the Mackey-Glass time series with cascade correlation learning, in: Proc. Connectionist Models Summer School, 1990, pp. 117– 123. [7] R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, second ed., John Wiley & Sons, Inc., New York, 2001. [8] P.C. Fiss, Building better causal theories: a fuzzy set approach to typologies in organization research, Acad. Manage. J. 54 (2011) 393–420. [9] A. Frank, A. Asuncion, UCI Machine Learning Repository, 2010. . [10] T.C. Havens, J.C. Bezdek, C. Leckie, L.O. Hall, M. Palaniswami, Fuzzy c-means algorithms for very large data, IEEE Trans. Fuzzy Syst. 20 (Dec.) (2012) 1130–1146. [11] S. Haykin, Neural Networks and Learning Machines, third ed., Prentice-Hall, Upper Saddle River, NJ, 2009. [12] R.J. Hyndman, Time Series Data Library, September 2010. . [13] S. Katsiantis, D. Kanellopoulos, P. Pintelas, Data preprocessing for supervised learning, Int. J. Comput. Sci. 1 (2) (2006) 111–117. [14] M.M. Korjani, J.M. Mendel, Fuzzy set qualitative comparative analysis (fsQCA): challenges and applications, in: Presented at 2012 NAFIPS Conf., Berkeley, CA, August 2012. [15] M.M. Korjani, J.M. Mendel, Non-linear variable structure regression (VSR) and it application in time-series forecasting, in: Proc. of FUZZ-IEEE, Beijing, China, July 2014. [16] J.M. Mendel, M. Korjani, Charles Ragin’s fuzzy set qualitative comparative analysis (fsQCA) used for linguistic summarizations, Inform. Sci. 202 (2012) 1–23. [17] J.M. Mendel, M.M. Korjani, Theoretical aspects of fuzzy set qualitative comparative analysis (fsQCA), Inform. Sci. 237 (2013) 137–161. [18] J.M. Mendel, D. Wu, Perceptual Computing: Aiding People in Making Subjective Judgments, Wiley and IEEE Press, Hoboken NJ, 2010. [19] C.C. Ragin, Redesigning Social Inquiry: Fuzzy Sets and Beyond, Univ. of Chicago Press, Chicago, IL, 2008. [20] B. Rihoux, C.C. Ragin (Eds.), Configurational Comparative Methods: Qualitative Comparative Analysis (QCA) and Related Techniques, SAGE, Los Angeles, CA, 2009. [21] C. Ritz, J.C. Streibig, Nonlinear Regression With R, Springer, New York, 2008. [22] A. Sen, M. Srivastava, Regression Analysis: Theory, Methods, and Applications, Springer-Verlag, New York, 1990 (Chapter 9: Transformations). [23] S.S. Viglione, Applications of pattern recognition technology, in: J.M. Mendel (Ed.), A Prelude to Neural Networks: Adaptive and Learning Systems, Prentice-Hall, Englewood Cliffs, NJ, 1994, pp. 115–162. [24] http://www.techopedia.com/definition/14650/data-preprocessing> (Accessed 11.11.13).