Getting Started With Data Mining
Here are just notes from my data mining class which I will begin to consolidate here in my blog as a way to assimilate the lessons.
1. The market basket model is probably the easiest introduction to anyone interested in data mining. The concept is simple. There are baskets and there are items in those baskets.
The market basket model: There are items and there baskets, also called itemsets, that hold those items.
2. Closely related to the the Market Basket model is the concept of frequent itemsets. Intuitively, a set of items is frequent if it occurs many times.
3. The following terms are used a lot when talking about frequent itemsets:
- Support count – is a term that refers to the number of times an itemset appears in a set of baskets. For example in the basket set
[ ('cheese', 'milk', 'eggs'), ('milk'), ('milk', 'eggs', 'bread') ]
the support count of the itemset (‘milk’, ‘eggs’) is 2 since it was a subset two times
- Support threshold – is a numerical limit that draws the line between a frequent itemset and non-frequent itemset. For example, in the basket set above, if one sets the support threshold at >= 2, then one can say that the itemset (‘milk’, ‘eggs’) is frequent.
4. Frequent itemsets are presented as an if-then rule like so: where I is a set of items and j is an item. This representation is called an association rule. In words, it can be said that if I appears in a basket then j is “likely” to appear as well.
5. In data mining parlance, the concept of ‘likely’ is more formally known as the confidence of the rule . Mathematically,
The insights behind this formula are that
- Baskets with in them cannot be more than baskets with I in them. Think about it. If one has I then there may or may not be the j around it in the same basket.
- Having said that, the more baskets with in them, the better. This makes the confidence in the rule stronger
- If is the same as the then that means that the confidence is 1 or 100%. In other words all the time!
6. In an association rule, interest is an indicator of how the item on the left affects the item on the right in . The formula is:
The insight behind this formula is that
- If the confidence outweighs the fraction of baskets with j, then it can be said that there is indeed a correlation between I and J and/or the presence of I somehow affects the presence of j.
On the other hand, If there are significantly more baskets with j but not I, then the association rule isn’t really strong. It definitely is not the presence of I that implies the presence of j but something else. The instances when I and j are together in a basket can be said to be isolated