Data Lab: Implementing Apriori Algorithm In Hadoop-HBase - Part 1 : Introduction to Apriori Algorithm

Tuesday, 9 December 2014

Implementing Apriori Algorithm In Hadoop-HBase - Part 1 : Introduction to Apriori Algorithm

Apriori algorithm is a frequent item set mining algorithm used over transactional databases, proposed by Rakesh Agrawal and Ramakrishnan Srikant in 1993. This algorithm proceeds by identifying the frequent individual items in the database and extending them to larger and larger item sets as long as those item sets appear sufficiently often in the database. The frequent item sets determined by Apriori can be used to determine association rules which highlight general trends in the database.

Before we go further and see how this algorithm works it is better to be familiar terminologies used in this algorithm-

Tid | Items
1     | Bread, Milk
2     | Bread, Diaper, Beer, Milk
3     | Milk, Diaper, Beer, Coke
4     | Bread, Milk, Diaper, Beer
5     | Bread, Milk, Diaper,Coke

Itemset

A collection of one or more items
Example: {Milk, Bread, Diaper}
k-itemset
An itemset that contains k items

Support count ()

Frequency of occurrence of an itemset
E.g. ({Milk, Bread, Diaper}) = 2

Support

Fraction of transactions that contain an itemset
E.g. s( {Milk, Bread, Diaper} ) = 2/5

Frequent Itemset

An itemset whose support is greater than or equal to a minsup threshold.

Association Rule

An implication expression of the form X  Y, where X and Y are itemsets.
Example: {Milk, Diaper}  {Beer}

Rule Evaluation Metrics

Support (s) - Fraction of transactions that contain both X and Y
Confidence (c) - Measures how often items in Y appear in transactions that
contain X.