Proceedings of the 33rd Hawaii International Conference on System Sciences - 2000
An Integrated Data Mining System to Automate Discovery of Measures of Association Cecil Chua Information Management Research Centre School of Accountancy and Business Nanyang Technological University Singapore 639798
[email protected] Roger H.L. Chiang College of Business Administration University of Cincinnati Cincinnati, OHIO 45221 USA
[email protected] Ee-Peng Lim Center for Advanced Information Systems, School of Applied Science, Nanyang Technological University, Singapore 639798
[email protected] Abstract Many data analysts require tools which can integrate their database management packages (e.g. Microsoft Access) with their data analysis ones (e.g. SAS, SPSS), and provide guidance for the selection of appropriate mining algorithms. In addition, the analysts need to extract and validate statistical results to facilitate data mining. In this paper, we describe an integrated data mining system called the Linear Correlation Discovery System (LCDS) that meets the above requirement. LCDS consists of four major subcomponents, two of which, the selection assistant and the statistics coupler, are discussed in this paper. The former examines the schema and instances to determine appropriate association measurement functions (e.g. chi-square, linear regression, ANOVA). The latter invokes the appropriate statistical function on a sample data set, and extracts relevant statistical output such as η 2 , and R2 for effective mining of data. We also describe a new validation algorithm based on measuring the consistency of mining results applied to multiple test sets.
1
Introduction
Due to budget constraints on IT spending, many organizations are unable or unwilling to purchase expensive data mining systems, or hire data mining consultants. In these organizations, data analysts face the following problems.
• Inadequacy of off-the-shelf tools: Often, the analysts do not have an integrated data management/data mining/data analysis tool available, as only off-the-shelf tools are available to them. • Analysts’ time is expensive: The analysts have other, more pressing duties than performing data mining. They want to spend a minimum amount of time coding, integrating or analyzing the data, but want meaningful, comprehensible results from the mining. • Analysts are not experts: Analysts have little formal training in data mining, data analysis, or data management. Trained analysts continue to be in short supply, especially since the growth in data to be analyzed continues to outpace the number of new trained data analysts entering the market [18]. To overcome the above problems, an intelligent data mining integrator that facilitates the integration of the offthe-shelf tools and DBMSs is needed. We therefore propose a data mining integrator as shown in Figure 1. The components of the integrator include: • Statistical interface: The most popular tools for data analysis are statistical packages. The data mining integrator must be able to utilize available statistical packages so that advanced statistical functions such as logit analysis [28] and canonical correlation [23] can be applied to appropriate data sets during the data mining process.
0-7695-0493-0/00 $10.00 (c) 2000 IEEE
1
Proceedings of the 33rd Hawaii International Conference on System Sciences - 2000
Figure 1. Components of an intelligent data mining integrator • Data mining interface: The data mining integrator is responsible for communicating with the different data mining tools. Since most data mining packages do not share a common and accepted interface, customized data mining interfaces have to be built for each package to be integrated.
mining (e.g. see [5, 27, 30]), it is often difficult to integrate association measurement functions in automated (i.e. unsupervised or minimally supervised) data mining systems for three reasons, 1) the multiplicity of measures, 2) the lack of validity measures for data mining, and 3) the lack of a system interface.
• DBMS interface: The data mining integrator must be able to communicate with the available DBMSs. Since the issue of accessing heterogeneous DBMSs has been addressed by such as (ODBC, RDA, SQL), we do not address this issue.
2.1
Since there are so many association measurement functions (e.g. chi-square, linear regression, logistic regression, logit, probit, ANOVA, etc.), the selection and application of these functions poses several problems including:
• Intelligent assistant: The integrator must be able to guide the analyst in selecting the appropriate algorithm for mining data, and in extracting relevant results. This functionality can be provided by a selection assistant and a statistics or data mining coupler.
• Reconciliation of heterogeneous results: The various measures of association evaluate different aspects of the association between data sets, and have different scales. For example, the coefficient of determination (linear regression) measures the strength of the association, and ranges from 0 to 1, while chi-square measures the existence of an association and ranges from 0 to infinity [22].
This paper focuses on the development of the selection assistant and statistics coupler for a prototype data mining system called the Linear Correlation Discovery System (LCDS). LCDS is used to guide novice analysts in knowledge discovery. This paper also proposes a new method to validate associations discovered in large databases. Section 2 discusses the application of measuring associations for data mining. Section 3 discusses the prototype system development issues. Section 4 describes the selection assistant. Section 5 describes the statistics coupler. Section 6 discusses our validation method. Finally, Section 7 concludes the work.
2
Multiplicity of measures
• Measurement Function Selection: Each association measurement function is used only under very narrowly defined conditions. For example, chi-square is only applicable when attributes are on the nominal measurement scale. The analyst must be able to select the appropriate association measurement function for each condition, and must be conversant with its use and the interpretation of the function’s results.
Applications of measures of association
2.2
Measures of association such as chi-square, and R2 are statistical measures used to evaluate the similarity between data sets. While measures of association are useful for data
Validity
The measures of association were developed in the context of the hypothesis-test approach of statistics. In this approach, an analyst makes a testable statement about the data 2
0-7695-0493-0/00 $10.00 (c) 2000 IEEE
2
Proceedings of the 33rd Hawaii International Conference on System Sciences - 2000
hood ratio test statistic for function Y , X is the best-fit model derived, and X0 is the best guess model without the independent variable(s) [6, 28].
being analyzed. The measure of association is then used to test the extent that the data supports the statement. However, the approach of data mining is more exploratory in nature [16]. Instead of trying to confirm a hypothesis, the analyst is responsible for discovering interesting patterns in the data, i.e. generating hypotheses. Few satisfactory methods for validating data mining results have been proposed. Validation via statistical significance is not effective when statistical functions are applied in a data mining context [14]. Traditional methods for controlling statistical significance levels (e.g. Bonferroni, ˇ ak) [33, 36] fail to identify many Tukey-w, Dunn, Dunn-Sid´ interesting associations due to low statistical power. More modern validation methods such as the Bootstrap, Jackknife, or Cross-Validation ([13, 31]) are computer intensive, and are not appropriate for the analysis of large databases.
2.3
• Output representation: The output of most statistical packages is semi-structured, and varies depending on the input data. The analyst must know the exact layout and format of the output of the statistical package to get the required information. As a result of these problems, measuring and validating associations of data in databases is often a tedious and difficult task with great potential for error.
3
The linear correlation discovery system
We have developed a prototype integrated data mining system called the Linear Correlation Discovery System to measure associations between data sets (i.e. variables). To maintain compatibility among results of association measurement functions (i.e. solve the reconciliation of results), LCDS only measures the strength and the statistical significance of the association. In this paper, strength is measured using the proportion of explained variance of the association (i.e. correlation). Examples of measures of association used in LCDS include R2 (regression), η 2 (ANOVA) and Goodman and Kruskal’s λ (chi-square). Analysts only need to understand two statistical concepts (i.e. strength and significance) to comprehend results generated by LCDS. LCDS has been implemented in Visual Basic using Microsoft Access 97 and SPSS v. 8.0. It runs on any PC supporting Windows 95/NT. The relational data model is adopted as the structure of the data for mining.
System interface
Most database management and data mining packages do not incorporate functions to calculate measures of association such as R2 and χ2 . Thus, the data analyst must often integrate the DBMS with a statistical package. Many offthe-shelf statistical packages are designed for manual operation. As a result, data analysts often face the following problems: • Poor communication facilities: Most DBMS and statistical packages only have primitive facilities for communicating with each other. Often, only straightforward import or export of data (e.g. through ODBC) between packages is supported. • Language incompatibility: Different languages are required for communicating with a DBMS or statistical package. For example, the statistical packages SPSS and SAS have their own languages for data analysis. The analyst must know both the data manipulation and data analysis languages to perform mining. • Irrelevant statistical output: Statistical software often generates information that is not relevant for the current data mining task. For example, a statistical package such as SPSS will generate at least a page of information, even though only one statistical value (often the statistical significance score) is considered for the analysis. • Incomplete information: Sometimes the information required by the data mining system is not found in the statistical output, but can be derived from it. For example, the logit and probit association measurement functions of most statistical packages often do not produce R2 . However, it can be derived using the for2 2 (X) mula R2 = G (XG02)−G , where G2 (Y ) is the likeli(X0 )
Figure 2. Process flow of LCDS 3
0-7695-0493-0/00 $10.00 (c) 2000 IEEE
3
Proceedings of the 33rd Hawaii International Conference on System Sciences - 2000
An analyst who uses LCDS is expected to follow the task sequence shown in Figure 2. The figure also indicates the LCDS components involved at every step in the sequence. Some of the steps involve multiple components. The process of LCDS is as follows:
recalculated on the test sets using Fisher’s combined test [38], a meta-analytical test that can draw conclusions from multiple studies. The scores of the new samples can be used to verify that the associations consistently exceed threshold values. Pairs with associations that exceed threshold values are reported to the data analyst.
1. Import relation: A relation is imported into LCDS using the Microsoft Access import routines. The schema information (e.g. data type, attribute name, key attributes) is also retrieved from the data dictionary of the imported relation.
4
The selection assistant
The selection assistant has three components, 1) the attribute classifier, 2) the attribute group classifier and 3) the measurement function selector. The selection process is presented in Figure 3.
2. Generate random sample: In large relations, a random sample of instances from the relation to be examined is extracted. This reduces processing time without unduly compromising the quality of discovered results [21, 24, 25, 35].
4.1
3. Derive properties of attributes and groups: The schema and instances of the attributes and sets of attributes are examined to derive properties (e.g. distinctness, order, distance) of the attributes.
The attribute classifier
The attribute classifier assigns domain classes to attributes based on properties derived from schema information and instances. It consists of 1) a classification scheme, 2) a static command library, 3) a rule base, 4) a historical record, and 5) a rule evaluator which integrates and operates these components to classify attributes. We describe the design of each of these components below.
4. Find correlation for one pair: The selection assistant selects the appropriate association measurement function for a user specified pair of attributes (or set of attributes) based on the attributes’ properties. The statistics coupler executes the function and extracts the strength, and the significance of the association from the statistical output. If the strength and significance scores exceed user-determined threshold values, then all attribute groups which have that pair as a subset are flagged as non-analyzable. The strong association between the pair would distort any analyses involving an attribute group that was a superset of that pair. For example, if the association between {Salary} and {Age} were strong and significant, LCDS would not measure the correlation of {Salary, Age} with other attribute groups as {Salary, Age} exhibits multicollinearity.
Classification scheme: The classification scheme describes the domain classes that each attribute can be assigned. The analyst can specify any classification scheme for attributes so long as it is based on properties derived from schema information and instances. For example, any of the classification schemes found in [7, 20, 32]) can be specified. The default scheme [7] classifies attributes and groups as N UMBER (e.g. Salary), DATE (e.g. Date of Birth), C ATEGORICAL (e.g. Occupation), O RDINAL (e.g. Letter Grade), or D ICHOTOMOUS (e.g. Sex).
5. Find correlation for all pairs: This step is similar to Step 4. The system finds the strength, and the significance of the association for all attribute group pairs beginning with the pair(s) that have the fewest attributes (i.e. the association of pairs with a total of 2 attributes are measured first). The supersets of any pair that have a strong and significant association are flagged as nonanalyzable.
Static command library: The static command library provides functions to derive properties of the attributes. For example, consider the following rule. IF
6. Resample results: Multiple samples are extracted from the original relation. These test sets are used to validate discovered associations.
THEN
the maximum length of the attribute is less than or equal to the user-defined constant ‘LENGTH’ and the maximum number of distinct instances is less than or equal to the userdefined constant ‘INSTANCE’ the attribute can not be N UMBER or DATE
To implement this rule, the static command library provides the examination functions to count the number of distinct instances, and measure the length of each instance.
7. Validate results using new samples: The association of all pairs whose scores exceed threshold values are 4
0-7695-0493-0/00 $10.00 (c) 2000 IEEE
4
Proceedings of the 33rd Hawaii International Conference on System Sciences - 2000
Figure 3. Selection assistant process Rule base: Each attribute may be assigned at most one domain class from the classification scheme. The rules for determining this domain class are found in the rule base. All rules are formatted as follows. IF precedent THEN P (A)0 = P (A) − P (precedent)
Each rule has the following parameters: • Rule ID: This is a unique identifier for each rule. • Priority: This determines which rule is examined first. Priority is established based on the likelihood of the rule correctly classifying an attribute.
where P (X) refers to a set of possible domain classes associated with X, A refers to the attribute under examination, and precedent is a condition that evaluates to true or false. Some examples of rules in the rule base include: IF THEN
IF
THEN IF
THEN IF THEN
• Precedent: The precedent of each rule. This must evaluate to true or false. • Antecedent: This stores all information about P (precedent). For example in:
The data type of the attribute is of type ‘Date’ the attribute can not be N UMBER, D ICHOTO MOUS , O RDINAL , or C ATEGORICAL
IF THEN
The maximum length of the attribute is less than or equal to the user-defined constant ‘LENGTH’ and the maximum number of distinct instances is less than or equal to the userdefined constant ‘INSTANCE’ the attribute can not be N UMBER or DATE
The data type of the attribute is of type ‘Date’ the attribute can not be N UMBER, D ICHOTO MOUS , O RDINAL , or C ATEGORICAL
{N UMBER, D ICHOTOMOUS, O RDINAL, C ATEGORI CAL } would be the antecedent. • Required information: Defines the kinds of attribute properties used by the rule.
The data type of the attribute is of type ‘String’ and at least one instance contains the letters ‘A’ . . . ‘Z’ or ‘a’ . . . ‘z’ and at least one instance has a length different from the length of all other instances the attribute can not be N UMBER or DATE
• Terminal rule indicator: Identifies whether the rule can be used to determine the domain class of the attribute. For example, consider the rule: IF THEN
The attribute is a candidate key the attribute cannot have the domain classes {N UMBER, DATE, O RDINAL, C ATEGORI CAL, D ICHOTOMOUS}, i.e. it cannot have a domain class.
If an attribute has the Date data type the attribute cannot have the domain classes {N UMBER, O RDINAL, D ICHOTOMOUS, C ATEGORICAL}
The Terminal Rule Indicator of this rule is set to true, because the rule conclusively determines an attribute with the DATE domain class. 5
0-7695-0493-0/00 $10.00 (c) 2000 IEEE
5
Proceedings of the 33rd Hawaii International Conference on System Sciences - 2000
Historical record: Every time the precedent of a rule is evaluated as true, the rule ID, and the attribute being evaluated are stored in the historical record. Analysts can later use this information to confirm the domain class the assistant assigned to an attribute.
attributes within a group, classification of attribute groups is always an associative process [7]. For example, the attributes Salary, Age, and Weight are all assigned the N UMBER domain class. The attribute group classifier assigns {Salary, Age, Weight} the N UMBER domain class because the domain classes of all attributes are N UMBER. The attribute group {Salary, Age, Weight} would always be assigned the same domain class as {Age, Salary, Weight}.
Rule evaluator: The rule evaluator accesses the rule base and classification scheme, and uses the properties derived by the command library to classify attributes. Classification of attributes is done through the process of elimination. First, each attribute is assigned all possible domain classes (i.e. P (A) = U, where U is the universal set of all possible domain classes). The rules are evaluated starting from the rule with the highest priority (i.e. the one most likely to classify attributes correctly). When evaluating each rule, the system first checks if the properties required to process the rule can be derived by checking the Required Information parameter. For example, rules which require key information are not evaluated if key information is not available. If the precedent is true, then the domain classes specified in the antecedent (i.e. P (precedent)) are removed from the list of possible domain classes of that attribute. The assistant then records in the historical record that the rule was involved. Finally, if a rule’s Terminal Rule Indicator is set to true, then the remaining possible domain classes for that attribute are counted. If only one domain class exists, this domain class is assigned to the attribute. Otherwise, the rule with the next priority is applied. If all rules have been evaluated, and the attribute has more than one possible domain class, the attribute is assigned U NKNOWN. Attributes that have N O C LASS, or an U NKNOWN domain class will not be used in further processing, as they will not provide relevant and interesting patterns to be mined. For example, there is no need to measure the association between Employee Number (a candidate key) and Salary, as any association discovered will be meaningless. Exceptions to this rule (e.g. Employee Number is assigned sequentially based on Date of Employment and can serve as a proxy) are handled by making these exceptions special domain classes, and including special rules for them in the rule base.
4.3
Selection of association measurement functions
The selection of appropriate association measurement functions is then determined using the domain classes of the pair of attribute groups to be analyzed. For example, if the association between Sex and Is Married is measured, the default association measurement function is the phi coefficient, since both Sex and Is Married are D I CHOTOMOUS. If the association between Salary and Sex is measured, a logistic regression is the default association measurement function, since Salary is a N UMBER. The analyst can change the default measurement function to better fit the current analysis context.
4.4
Selection assistant performance
We have tested our selection assistant against several publicly available data sets, including [2, 3, 8, 11, 12, 15, 17, 19, 26]. The majority of these data sets are available from the CMU and UCI data repositories [4, 9]. Of a total of 119 attributes, 112 were identified correctly. Table √ 1 summarizes the results of our experiments. % Attr indicates the number of attributes correctly classified, Incorrect Classification indicates which attributes were incorrectly classified Our rules are 93.6% accurate with a standard deviation of 0.075. Thus, (assuming these data sets are representative) our heuristics should be at least 78% accurate for 95% of all data sets. Half of the incorrect classifications occurred in attributes that should have been classified as N UMBER but were classified as O RDINAL. These attributes have the following properties: • The data values are represented by integers.
4.2
Classification of attribute groups
• They have few distinct instances. • They are predominantly used to ‘count’ something (e.g. years of Education, No of Children). The sole exception was the attribute Time to accelerate.
The assignment of domain classes to an attribute group is determined by the domain classes of its attributes. If at least one of the attributes in the attribute group is assigned either N O C LASS or U NKNOWN, the attribute group will be assigned N O C LASS. Since the selection of association measurement functions is not dependent on the ordering of
We are currently incorporating new rules into our rule base to handle attributes with these kinds of properties. 6
0-7695-0493-0/00 $10.00 (c) 2000 IEEE
6
Proceedings of the 33rd Hawaii International Conference on System Sciences - 2000
Data Set [2] [3] [8] [11] [12] [15] [17] [19] [26]
% Attr
√
90.9 100.0 93.3 87.5 77.7 100.0 100.0 100.0 92.6
Incorrect Classification Education No error. Education Sample Date No of Cylinders, Time to accelerate No error. No error. No error. # in family, # kids.
The statistics coupler provides facilities to perform these tasks. The coupler consists of 1) a statistics library, 2) a settings dictionary, and 3) the coupling engine.
5.1
Five kinds of information are stored in the statistics library: 1) statistical identification, 2) statistical command, 3) statistical input, 4) statistical format, and 5) statistical output information. We explain each kind of information below. Statistical identification: Statistical identification (e.g. “Linear Regression”, “Chi-Square”, “Random Sampling”) uniquely identifies each statistical function.
Table 1. Application of heuristic rules to data sets
5
Statistics library
Statistics command: This is information necessary for generating the command file. For example, the command to invoke an ANOVA in SPSS would be coded in a command variable as OneWay %1 BY %2 where %1 and %2 are markers for the placement of attribute groups.
Statistics coupler
Most of the high-end statistical packages (e.g. SPSS, SAS, Minitab) allow users to submit batch requests following the job submission process illustrated in Figure 4. In this process, the user submits a sequence of analysis commands, and a data file to be analyzed. The package generates an output file with the results specified by the statistical commands.
Statistics input: Statistics input describes what the applicable attribute groups are for each function. They provide templates, describing what domain classes can be applied to the various association measurement functions. Statistics format: Statistics format allows the analyst to specify whether attribute groups need to be recoded to suit the association measurement function. For example, Goodman and Kruskal’s Lambda is typically used for measuring the strength of the association of individual attributes on the nominal measurement scale. Multi-attribute attribute groups with nominal measurement scales can also have their associations measured, so long as the attributes of those attribute groups are merged. Thus, the association between {Occupation, Sex} and {Citizenship} can be measured if {Occupation, Sex} is recoded as one attribute.
Figure 4. Batch job processing in statistical packages
In order to interface with the statistical package, a data mining system must submit a request in the following manner. • Generate a command file. The command file contains instructions which the statistics package can process.
Statistical output: Statistical output information specifies how the output generated by the statistical package should be read and interpreted. Output generated by statistical packages is often in a semi-structured format. To extract information automatically from statistical output, we require two kinds of information to be embedded in the statistics coupler. These are:
• Prepare a data file. The data file must be in a format which can be read by the statistical package. This is no longer a difficult problem, as many interchangeable file formats exist. • Invoke the statistics package. The statistics package must be spawned by the data mining system. Most data mining systems are able to do this.
• Result layout: The layout of relevant results in all possible permutations of the output must be known.
• Scan and extract the statistics output. The statistical output is often semi-structured. Pertinent results must be extracted from the output.
• Resolution of conflicts: We must be able to determine which result should be extracted. Conflict resolution is more than determining where relevant results 7
0-7695-0493-0/00 $10.00 (c) 2000 IEEE
7
Proceedings of the 33rd Hawaii International Conference on System Sciences - 2000
that the categorical attribute group (i.e. {Occupation, Sex}) must be recoded as a single attribute. The coupler performs the conversion and exports the data to the statistical package. The coupler then determines the appropriate analysis command, and invokes the function on the package. When the output is generated, the coupler reads the between group and total sum of squares (SS) from the output, and calculates eta-squared using the formula 2 η 2 = BetweenSS T otalSS and returns η to the data mining system.
are located. It also implies deriving relevant information when it is not present in the output. For example, R2 can be derived from output produced by a logit function by extracting the likelihood ratio test statistic of the best fit model and the independent best guess model. Conflict resolution also implies intelligent decision making. For example, a conflict resolution function can be specified so that the data mining system extracts the R2 value from the output if the sample size of an analysis is below a particular value. Otherwise, the adjusted R2 value is extracted.
5.2
6
Validation of discovered associations is especially important, since data mining involves the analysis of many attribute groups. As more attributes are analyzed, the number of artifact relationships discovered increases [14]. For example, if 20 associations are discovered with statistical significance set at the 0.05 level, the probability of discovering an artifact association is 64% (i.e. 1 − (1 − 0.05)20 ). When measuring the validity of an individual result, only individual pair significance levels must be considered. However, when several results must be evaluated simultaneously (as is common in data mining), the all pairs significance level must be taken into account [33, 36, 37]. Traditionally, the all-pairs significance threshold is controlled by reducing individual pair significance thresholds. As a result, many interesting associations are not considered as interesting.
Settings dictionary
Facilities to provide flexible integration with statistical packages are needed. For example, the data analyst must be able to substitute or integrate additional statistics packages with little difficulty. The settings dictionary contains information to facilitate such integration. Information contained in the settings dictionary includes the name of the statistics package to be invoked, the names and locations of relevant files, the format of data files for export to the statistics package, and synchronization information. The statistics coupler modifies its procedure based on the setting information.
5.3
Validation of mining results
Coupling engine
6.1
Given a pair of attribute groups to be associated, and an association measurement function, the statistics coupler executes following these steps:
Validation method
We have developed a novel validation method which has the following procedure:
1. Examine statistics input information to determine if the pair of attribute groups can be analyzed by the function. If not, the coupler returns an error message.
1. Set threshold values: The analyst specifies the allpairs significance threshold value for validation. An intermediate significance threshold value is then calculated from the all-pairs significance threshold value. The calculation is performed using an alpha-threshold based validation method (e.g. Bonferroni, Dunn).
2. Examine the statistics format information to determine if any data conversion must be performed on the data set to be mined. If so, data is converted. 3. Look up settings information to communicate with the statistics package, and statistics command information for the relevant function(s). The statistics coupler then generates the necessary commands in the language of the statistical package, and invokes the package.
2. Generate multiple test sets: Multiple test sets are then generated by extracting random tuples from the original relation. 3. Reapply measurement functions: The correlation and significance scores of the test sets are recalculated and retained.
4. Extract the relevant information from the statistics output, and return the result to the data mining system.
4. Compare obtained p-values:. We adapt Fisher’s combined test [38] to validate the results of the test set. The equation for Fisher’s combined test is:
For example, an analyst wants to extract the η 2 measure from the ANOVA on the attribute groups {Occupation, Sex} and {Salary}. The coupler reviews the attribute groups and determines that an ANOVA is appropriate for analyzing these groups. However, the coupler determines
χ2df =2k = −2
k X
(ln pi )
i=1
8
0-7695-0493-0/00 $10.00 (c) 2000 IEEE
8
Proceedings of the 33rd Hawaii International Conference on System Sciences - 2000
where k is the number of resamples, pi is the pvalue obtained for the association from each test set, and χ2df =2k is the chi-square statistic with 2k degrees of freedom. The p-value of χ2df =2k is then compared against the intermediate significance threshold value. If the intermediate significance threshold value is greater, the association is considered valid.
reduced. As a result, we recommend that the sample size of each test set should be the largest size that can fit into main memory.
7
Data analysts often do not have integrated tools to perform data mining. Also, many individuals who perform data analysis are unfamiliar with analysis methods, or tools. In this paper, we have described the development of an integrated data mining system that can select appropriate measures of association for a data set, execute the association measurement function, extract results from the statistical output and validate them. A prototype system called the Linear Correlation Discovery System (LCDS) that incorporates the selection assistant and statistics coupler has been implemented in Visual Basic using the Microsoft Access Database Engine and the SPSS Production Facility. We are currently extending LCDS to integrate with other data mining packages (e.g. association rule discovery packages). As with measures of association, appropriate association rules must be selected based on characteristics of the attributes being examined. For example, quantitative [34], boolean [1], and interval [29] association rules are applicable only when attributes are O RDINAL, C ATEGORICAL, and I NTERVAL respectively. We are also investigating whether the association between attribute groups are good predictors for quantity and quality of knowledge discovered using other data mining algorithms. For example, a strong association between two attribute groups should imply that many non-spurious association rules can be discovered between the instances of those groups.
For example, 20 associations discovered in a sample from a data set were found to be significant at the 0.05 level. We want to ensure that the significance level for the combined set of 20 associations is at most 0.05. According to the Dunn method, the accepted significance threshold for one pair will be 0.05 ÷ 20 = 0.0025. To validate the associations, five test sets are generated from the original data set, The associations are recalculated for each test set. For one pair of attributes, the p-values of the test sets were calculated as 0.05, 0.06, 0.04, 0.05 and 0.04. Fisher’s combined test shows that the combined pvalue for these results is 0.0007. Since 0.0007 < 0.0025, the association for this pair will be considered valid. Our validation method is established on the common practice for validating experimental results in academic research [39]. Experiments which discover interesting results are repeated on additional samples drawn from the original population. Results that are replicated in the additional samples are considered valid.
6.2
Conclusion
Advantages and constraints
Our validation method has greater statistical power than traditional validation methods, since (in our method) individual pair α thresholds do not have to be set to unreasonably low values. Furthermore, when applied to large data sets, the computation time required for our method is less than traditional ones. The analysis of large data sets often require frequent disk accesses. Since small partitions of data sets can easily fit in main memory, disk access time is greatly reduced. Our method also requires less computation time than most randomization tests. For example, to calculate confidence intervals for the bootstrap, over 1000 resamples are required [13]. In our earlier example, only five samples were used. However, our method has two primary constraints. First, it cannot be applied to small data sets. When the sample sizes of each test set decreases, the statistical power [10] of our method decreases as well. Second, the all-pairs significance of our validation method can never be equal to traditional ones. In traditional methods, each data point is considered with respect to every other data point in the data set. In our method, each data point is considered only with respect to other data points in the test set. Thus, outliers have a greater effect on results. However, as the sample size of each test set increases, the effect of outliers on results is
References [1] R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In Proceedings of the 1993 ACM SIGMOD Conference on Management of Data, pages 207–216, 1993. [2] E. R. Berndt. Determinants of wages from the 1985 current population survey. In The Practice of Econometrics: Classic and Contemporary, chapter 5, pages 193–209. AddisonWesley, 1991. [online] http://www.stat.cmu.edu/ datasets/. [3] T. J. Biblarz and A. E. Raftery. The effects of family disruption on social mobility. American Sociological Review, 58:97–109, February 1993. [4] C. Blake, E. Keogh, and C. J. Merz. UCI repository of machine learning databases. [online] http://www.ics. uci.edu/˜mlearn/MLRepository.html, 1998. [5] S. Brin, R. Motwani, and C. Silverstein. Beyond market baskets: Generalizing association rules to correlations. In Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data, pages 265–276, 1997.
9
0-7695-0493-0/00 $10.00 (c) 2000 IEEE
9
Proceedings of the 33rd Hawaii International Conference on System Sciences - 2000
[27] T.-S. Lim, W.-Y. Loh, and Y.-S. Shih. A comparison of prediction accuracy, complexity, and training time of thirtythree old and new classification algorithms. Machine Learning, 1999. Forthcoming. [28] J. S. Long. Regression Models for Categorical and Limited Dependent Variables. SAGE Publications, 1997. [29] R. J. Miller and Y. Yang. Association rules over interval data. In Proceedings of the 1997 ACM SIGMOD Conference on Management of Data, pages 452–461, 1997. [30] F. Mosteller and J. W. Tukey. Data Analysis and Regression - A Second Course in Statistics. Addison-Wesley, 1977. [31] W. S. Sarle. Neural networks faq, 1999. [online] ftp: //ftp.sas.com/pub/neural/FAQ.html. [32] P. D. Scott, A. P. M. Coxon, M. H. Hobbs, and R. J. Williams. SNOUT: An intelligent assistant for exploratory data analysis. In Principles of Data Mining and Knowledge Discovery, First European Symposium, PKDD ’97, pages 189–199, 1997. [33] J. P. Shaffer. Multiple hypothesis testing. Annual Review of Psychology, 46:561–584, 1995. [34] R. Srikant and R. Agrawal. Mining quantitative association rules in large relational tables. In ACM SIGMOD Conference on Management of Data, pages 1–12, 1996. [35] H. Toivonen. Sampling large databases for association rules. In In 22th International Conference on Very Large Databases (VLDB ’96), pages 134–145, 1996. [36] L. E. Toothaker. Multiple Comparisons for Researchers. SAGE Publications, 1991. [37] L. E. Toothaker. Multiple Comparison Procedures. SAGE Publications, 1993. [38] F. M. Wolf. Meta-Analysis: Quantitative Methods for Research Synthesis. Quantitative Applications in the Social Sciences. SAGE Publications, 1986. [39] J. Ziman. An Introduction to Science Studies: The Philosophical and Social Aspects of Science and Technology. Cambridge University Press, 1990.
[6] R. Christensen. Log-Linear Models. Springer-Verlag, 1990. [7] C. Chua, R. H. L. Chiang, and E.-P. Lim. A heuristic method for correlating attribute group pairs in data mining. In International Workshop on Data Warehousing and Data Mining (DWDM’98), 1998. 29-40. [8] March 1995 population survey - classical families. [online] http://www.stat.ucla.edu/data/fpp, 1995. [9] StatLib. [online] http://www.stat.cmu.edu/, 1998. [10] J. Cohen. Statistical Power Analysis for the Behavioral Sciences. Lawrence Erlbaum Associates, second edition, 1988. [11] L. H. Cox, M. Johnson, and K. Kafadar. Exposition of statistical graphics technology. In ASA Proceedings of Statistical Computation Section, pages 55–56, 1982. [12] D. Donoho and E. Ramos. PRIMDATA: Data sets for use with PRIM-H. [online] http://www.stat.cmu.edu/ datasets/, 1982. [13] B. Efron and R. J. Tibshirani. An Introduction to the Bootstrap. Chapman and Hall, 1993. [14] U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth. From data mining to knowledge discovery in databases. AI Magazine, pages 37–54, Fall 1996. [15] S. E. Fienberg, U. E. Makov, and A. P. Sanil. A bayesian approach to data disclosure: Optimal intruder behavior for continuous data. Technical Report 11/94, Carnegie-Mellon University, 1994. [16] C. Glymour, D. Madigan, D. Pregibon, and P. Smyth. Statistical themes and lessons for data mining. Data Mining and Knowledge Discovery, 1:11–28, 1997. [17] V. Greaney and T. Kelleghan. Equality of Opportunity in Irish Schools. Dublin: Educational Company, 1984. [18] D. J. Hand. Intelligent data analysis: Issues and opportunities. In Proceedings of the Second International Symposium, IDA-97, pages 1–14, 1997. [19] A. Heston and R. Summers. The penn world table (mark 5): An expanded set of international comparisons, 1950-1988. Quarterly Journal of Economics, 8(6):327–368, May 1991. [20] W. Hou. Extraction and application of statistical relationships in relational databases. IEEE Transactions on Knowledge and Data Engineering, 8(6):939–945, December 1996. [21] D. V. Huntsberger and P. Billingsley. Elements of Statistical Inference. Allyn and Bacon, 1987. [22] J. Jaccard and M. A. Becker. Statistics For the Behavioral Sciences. Wadsworth Inc., second edition, 1990. [23] J. F. H. Jr., R. E. Anderson, R. L. Tatham, and W. Black. Multivariate Data Analysis with Readings. Prentice-Hall, fifth edition, 1998. [24] J. Kivinen and H. Mannila. The power of sampling in knowledge discovery. In Proceedings of the 1994 ACM SIGACTSIGMOD-SIGART Symposium on Principles of Database Theory (PODS ’94), pages 77–85, 1994. [25] J. Kivinen and H. Mannila. Approximate dependency inference from relations. Theoretical Computer Science, 149(1):129–149, September 1995. [26] R. Kohavi. Scaling up the accuracy of naive-bayes classifiers: a decision-tree hybrid. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, pages 202–207, 1996.
10
0-7695-0493-0/00 $10.00 (c) 2000 IEEE
10