Rockstar Engineering

The Apriori Algorithm

Posted in Computer Science, Data Mining by Jose Asuncion on December 22, 2011

Here are just notes from my data mining class which I began to consolidate here in my blog as a way to assimilate the lessons.

The Apriori algorithm is a basic method for finding frequent itemsets. The latter is used to generate association rules with high confidence and high interest.

Here is my summary of it along with a running example. The following set of baskets will be used:

D = [ ('milk', 'cheese', 'tuna'),
      ('cheese', 'mushroom'),
      ('cheese', 'eggs'),
      ('milk', 'cheese', 'mushroom'),
      ('milk', 'eggs'),
      ('cheese', 'eggs'),
      ('milk', 'cheese', 'mushroom', 'tuna'), 
      ('milk', 'cheese', 'eggs') ]

Some definitions:

I – is the universal set of items. In the example above, the universal set would be {milk, cheese, tuna, mushroom, eggs}.

C_{k} – is like a k-combination of I. Like because items in this set should have frequent (k-1, k-2,…1)-itemsets.

Now, the apriori algorithm.

1. Generate C_{k} by cross joining the itemsets of L_{k-1} among themselves.

Cross joining two sets

A cross join between two k-item sets is a union of those two sets which results in a k+1 itemset. However, the join only happens if and only if the first k-1 items of both sets are equal.

For example:

[1,2] |x| [1,3] = [1,2,3]
[1,3] |x| [1,4] = [1,3,4]
[2,3] |x| [1,5] = no join

If k=1, simply list all 1 itemsets.

one_itemset = ['milk', 'cheese', 'eggs', 'mushroom', 'tuna']

2. Generate L_{k}, the frequent itsemsets by counting the number of times each element in C_{k} occurs in D. If an element in C_{k} \geq S or the support threshold, it is qualified to be a member of L_{k}. For k=1 and our example above for L_{1} is

c1 = {'cheese': 7, 
      'tuna': 2, 
      'eggs': 6, 
      'mushroom': 2, 
      'milk': 6}

3. Repeat the process for k \leq |I| until no frequent itemsets are found.

About these ads

One Response

Subscribe to comments with RSS.

  1. [...] This is a self imposed machine problem I wrote over a frantic afternoon yesterday for my lesson on Frequent Itemsets and the Apriori Algorithm. [...]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: