FIT: A Fast Algorithm for Discovering Frequent Itemsets in Large Databases Jun Luo Dept. of Computer Science Ohio Northern University Ada, OH 45810 Email:
[email protected] Sanguthevar Rajasekaran Dept. of Computer Science & Engineering University of Connecticut 191 Auditorium Road, U-155, Storrs, CT 06269-3155 Email:
[email protected] Abstract Association rule mining is an important data mining problem that has been studied extensively. In this paper, a simple but Fast algorithm for Intersecting attribute lists using a hash Table (FIT) is presented. FIT is designed for efficiently computing all the frequent itemsets in large databases. It deploys the similar idea as Eclat but has a much better computation performance than Eclat due to two aspects: 1) FIT has fewer total number of comparisons for each intersection operation between two attribute lists, 2) FIT significantly reduces the total number of intersection operations. The experimental results demonstrate that the performances of FIT are much better than those of Eclat and Apriori algorithms. Keywords: association rule, frequent itemset, FIT.
1. Introduction Association rule mining originated from the necessity of analyzing large amounts of supermarket basket data [2][3][5][6][9][10][11][12]. It is a wellstudied problem in data mining. The problem of mining association rules can be formally stated as follows: Let I={i1, i2, ... , in} be a set of attributes, called items. An itemset is a subset of I. D represents a database that consists of a set of transactions. Each transaction in D contains two parts: a unique transaction identification number (tid) and an itemset. The size of an itemset is defined as the number of items in it. An itemset with size d is denoted as d-itemset. A transaction (t) is said to contain an item (i) if i appears in t. A t is said to
contain an itemset X if all the items in X are contained by t. That the support of X is s means there are s transactions containing X in D. An X is said to be a frequent itemset if its support is greater than or equal to the user-specified minimum support (minsup). An association rule can be expressed as X=>Y, in which X, Y are itemsets and X ∩ Y= ∅ . That the rule X=>Y is said to have a support of s means s transactions in D contain the itemset of X ∪ Y. Also, that the rule X=>Y is said to have a confidence c means c percent of the transactions that contain X also contain Y. The symbol minconf is used to represent the user-specified minimum confidence. Given a D, minsup, and minconf, the problem of mining association rules is to generate all the association rules whose supports and confidences are greater than or equal to minsup and minconf, respectively. For convenience of discussion, some conventions are adopted in this paper: If X and Y are frequent itemsets, and the union of X and Y is also a frequent itemset, then X and Y are said to have a strong association relation. Otherwise, X and Y are said to have a weak association relation. If a symbol A represents a set or a list, then the notation |A| stands for the number of elements in A. Some other notations used in the rest of this paper are shown in Table 1. Generally speaking, the task of mining association rules consists of two steps: 1) Calculate all the frequent itemsets, 2) Calculate the association rules from the frequent itemsets that have been discovered in 1). Between the two steps, calculations of frequent
Table 1 Notations Notations
Remarks
Lk
The collection of frequent k-itemsets with their attribute lists.
ln
An attribute list in Lk , here 1≤n≤| Lk |.
lni
The i-th attribute of ln , here 1≤n≤| Lk | and 1≤i≤| ln |.
CFLi k
All the attribute lists that follow li in Lk, here 1 ≤ i1, are calculated before any of the frequent (k+1)-itemsets is calculated. The idea of the depth-first calculation is that given an Lk-1, k>1, if intersection results between an attribute list (lp, 1≤p