Rockstar Engineering

Getting Started With Data Mining

Posted in Data Mining by Jose Asuncion on December 21, 2011

Here are just notes from my data mining class which I will begin to consolidate here in my blog as a way to assimilate the lessons.

1. The market basket model is probably the easiest introduction to anyone interested in data mining. The concept is simple. There are baskets and there are items in those baskets.

The market basket model: There are items and there baskets, also called itemsets, that hold those items.

2. Closely related to the the Market Basket model is the concept of frequent itemsets. Intuitively, a set of items is frequent if it occurs many times.

3. The following terms are used a lot when talking about frequent itemsets:

  • Support count – is a term that refers to the number of times an itemset appears in a set of baskets. For example in the basket set

    [ ('cheese', 'milk', 'eggs'), ('milk'), ('milk', 'eggs', 'bread') ]

    the support count of the itemset (‘milk’, ‘eggs’) is 2 since it was a subset two times

  • Support threshold – is a numerical limit that draws the line between a frequent itemset and non-frequent itemset. For example, in the basket set above, if one sets the support threshold at >= 2, then one can say that the itemset (‘milk’, ‘eggs’) is frequent.

4. Frequent itemsets are presented as an if-then rule like so: I \to j where I is a set of items and j is an item. This representation is called an association rule. In words, it can be said that if I appears in a basket then j is “likely” to appear as well.

5. In data mining parlance, the concept of ‘likely’ is more formally known as the confidence of the rule I \to j. Mathematically,

Confidence(I \to j) = \frac{Support(I U {j})}{Support(I)}

The insights behind this formula are that

  • Baskets with I \cup {j} in them cannot be more than baskets with I in them. Think about it. If one has I then there may or may not be the j around it in the same basket.
  • Having said that, the more baskets with I \cup {j} in them, the better. This makes the confidence in the rule stronger
  • If Support(I \cup {j}) is the same as the Support(I) then that means that the confidence is 1 or 100%. In other words I \to j all the time!

6. In an association rule, interest is an indicator of how the item on the left affects the item on the right in I \to {j} . The formula is:

Interest = Confidence(I \to j) - \frac{\# of item j in baskets}{\# of baskets}

The insight behind this formula is that

  • If the confidence outweighs the fraction of baskets with j, then it can be said that there is indeed a correlation between I and J and/or the presence of I somehow affects the presence of j.
  • On the other hand, If there are significantly more baskets with j but not I, then the association rule I \to {j} isn’t really strong. It definitely is not the presence of I that implies the presence of j but something else. The instances when I and j are together in a basket can be said to be isolated

About these ads

2 Responses

Subscribe to comments with RSS.

  1. [...] are just notes from my data mining class which I began to consolidate here in my blog as a way to assimilate the [...]

  2. [...] is a self imposed machine problem I wrote over a frantic afternoon yesterday for my lesson on Frequent Itemsets and the Apriori [...]


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: