Tuesday, 9 December 2014

Implementing Apriori Algorithm In Hadoop-HBase - Part 1 : Introduction to Apriori Algorithm

Apriori algorithm is a frequent item set mining algorithm used over transactional databases, proposed by Rakesh Agrawal and Ramakrishnan Srikant in 1993. This algorithm proceeds by identifying the frequent individual items in the database and extending them to larger and larger item sets as long as those item sets appear sufficiently often in the database. The frequent item sets determined by Apriori can be used to determine association rules which highlight general trends in the database.

Before we go further and see how this algorithm works it is better to be familiar terminologies used in this algorithm-

Tid  | Items
1     | Bread, Milk
2     | Bread, Diaper, Beer, Milk
3     | Milk, Diaper, Beer, Coke
4     | Bread, Milk, Diaper, Beer
5     | Bread, Milk, Diaper,Coke
    • Itemset    
A collection of one or more items
Example: {Milk, Bread, Diaper}
k-itemset
An itemset that contains k items
  • Support count ()
Frequency of occurrence of an itemset
E.g.   ({Milk, Bread, Diaper}) = 2
  • Support
Fraction of transactions that contain an itemset
E.g.   s( {Milk, Bread, Diaper} ) = 2/5
  • Frequent Itemset
An itemset whose support is greater than or equal to a minsup threshold.

  • Association Rule
An implication expression of the form X  Y, where X and Y are itemsets.
Example: {Milk, Diaper}  {Beer}
  • Rule Evaluation Metrics
Support (s) - Fraction of transactions that contain both X and Y
Confidence (c) - Measures how often items in Y  appear in transactions that
contain X.


In next few post I will describe how to implement this algorithm in HBase and MapReduce.

2 comments:

  1. sir, please help me... how to execute apriori jar in hadoop....

    ReplyDelete
    Replies
    1. Hi. I need help in this topic.
      @saran - did u understand it?

      Delete