The Apriori Algorithm
Here are just notes from my data mining class which I began to consolidate here in my blog as a way to assimilate the lessons.
The Apriori algorithm is a basic method for finding frequent itemsets. The latter is used to generate association rules with high confidence and high interest.
Here is my summary of it along with a running example. The following set of baskets will be used:
D = [ ('milk', 'cheese', 'tuna'),
('cheese', 'mushroom'),
('cheese', 'eggs'),
('milk', 'cheese', 'mushroom'),
('milk', 'eggs'),
('cheese', 'eggs'),
('milk', 'cheese', 'mushroom', 'tuna'),
('milk', 'cheese', 'eggs') ]
Some definitions:
– is the universal set of items. In the example above, the universal set would be {milk, cheese, tuna, mushroom, eggs}.
– is like a k-combination of
. Like because items in this set should have frequent (k-1, k-2,…1)-itemsets.
Now, the apriori algorithm.
1. Generate by cross joining the itemsets of
among themselves.
Cross joining two sets
A cross join between two k-item sets is a union of those two sets which results in a k+1 itemset. However, the join only happens if and only if the first k-1 items of both sets are equal.
For example:
no join
If , simply list all 1 itemsets.
one_itemset = ['milk', 'cheese', 'eggs', 'mushroom', 'tuna']
2. Generate , the frequent itsemsets by counting the number of times each element in
occurs in
. If an element in
or the support threshold, it is qualified to be a member of
. For
and our example above for
is
c1 = {'cheese': 7,
'tuna': 2,
'eggs': 6,
'mushroom': 2,
'milk': 6}
3. Repeat the process for until no frequent itemsets are found.
[...] This is a self imposed machine problem I wrote over a frantic afternoon yesterday for my lesson on Frequent Itemsets and the Apriori Algorithm. [...]