The Apriori Algorithm

Introduction: Apriori is a seminal algorithm proposed by R. Agrawal and R. Srikant in 1994 for mining frequent itemsets for Boolean association rules. The name of the algorithm is based on the fact that the algorithm uses prior knowledge of frequent itemset properties, as we shall see following. Apriori employs an iterative approach known as a level-wise search, where k-itemsets are used to explore (k 1)-itemsets. First, the set of frequent 1-itemsets is found by scanning the database to accumulate the count for each item, and collecting those items that satisfy minimum support.

The resulting set is denoted L1.Next, L1 is used to find L2, the set of frequent 2-itemsets, which is used to find L3, and so on, until no more frequent k-itemsets can be found. The finding of each Lk requires one full scan of the database. To improve the efficiency of the level-wise generation of frequent itemsets, an important property called the Apriori property, presented below, is used to reduce the search space.

Algorithm: Apriori. Find frequent itemsets using an iterative level-wise approach based on candidate generation.

Input:

D, a database of transactions;

min sup, the minimum support count threshold.

Output: L, frequent itemsets in D.

Method:

1. L1 = find frequent 1-itemsets(D);

2. for (k = 2;L_k_-16= f;k ) f

3. Ck = apriori gen(L_k_-1);

4. for each transaction t 2 D f // scan D for counts

5. Ct = subset(Ck, t); // get the subsets of t that are candidates

6. for each candidate c 2 Ct

7. c.count ;

8. }

9. Lk = {c ∈ Ck|c.count ≥ min sup}

10. }

11. return L = U_kL_k;

procedure apriori gen(L_k_-1:frequent (k-1)-itemsets)

1. for each itemset l1 ∈ L_k_-1

2. for each itemset l2 ∈ L_k_-1

3. if (l1[1] = l2[1])^(l1[2] = l2[2])^:::^(l1[k-2] = l2[k-2])^(l1[k-1] < l2[k-1]) then f

4. c = l1 on l2; // join step: generate candidates

5. if has infrequent subset(c, L_k_-1) then

6. delete c; // prune step: remove unfruitful candidate

7. else add c to Ck;

8. }

9. return Ck;

procedure has infrequent subset(c: candidate k-itemset;

Lk-1: frequent (k-1)-itemsets); // use prior knowledge

1. for each (k-1)-subset s of c

2. if s ∉ Lk-1 then

3. return TRUE;

4. return FALSE;

SKEDSOFT