Mining Frequent itemsets – Apriori Algorithm

  • Post author:
  • Post last modified:March 15, 2018
  • Post category:Data Mining
  • Reading time:6 mins read

Mining Frequent item sets - Apriori Algorithm

Apriori algorithm is an algorithm for frequent item set mining and association rule learning over transaction databases. Its followed by identifying the frequent individual items in the database and extending them to larger and larger item sets as long as those item sets appear sufficiently often in the database. The frequent item sets determined by Apriori can be used to determine association rules which highlight general trends in the database.

Read:

Association rule mining

Association rule mining is defined as:

Let I= { …} be a set of ‘n’ binary attributes called items. Let D= { ….} be set of transaction called database. Each transaction in D has a unique transaction ID and contains a subset of the items in I. a rule is defined as implication of the form Xwhere X, Y I and X∩Y=Φ. The set of items X and Y are called antecedent and consequent of the rule respectively.

Useful Terms

To select interesting rules from the set of all possible rules, constraints on various measures of significance and interest can be used. The best known constraints are minimum thresholds on support and confidence.

Support

The support supp(X) of an item set can be defined as proportion of transactions in the data set which contain the item set.

Supp(X) = no. of transactions which contain the item set ‘X’ / total no. of transactions

Confidence

The confidence of a rule is defined as:

Conf (X→Y) = supp (XUY)/supp(X)

Definition of Apriori Algorithm

  • The Apriori Algorithm is an influential algorithm for mining frequent item sets for Boolean association rules.
  • Apriori uses a “bottom up” approach, where frequent subsets are extended one item at a time (a step known as candidate generation, and groups of candidates are tested against the data.
  • Apriori is designed to operate on database containing transactions (for example, collections of items bought by customers, or details of a website frequentation).

Key Concept

  • Frequent item sets: All the sets which contain the item with the minimum support (denoted as for item set.
  • Apriori Property: Any subset of frequent item set must be frequent.
  • Join operation: To find, a set of candidate k-item sets is generated by joining  with itself.

Apriori Algorithm Steps

Below are the apriori algorithm steps:

  1. Scan the transaction data base to get the support ‘S’ each 1-itemset, compare ‘S’ with min_sup, and get a support of 1-itemsets,
  2. Use join  to generate a set of candidate k-item set. Use apriori property to prune the unfrequented k-item sets from this set.
  3. Scan the transaction database to get the support ‘S’ of each candidate k-item set in the given set, compare ‘S’ with min_sup, and get a set of frequent k-item set
  4. If the candidate set is NULL, for each frequent item set 1, generate all nonempty subsets of 1.
  5. For every nonempty subsets of 1, output the rule “s=>(1-s)” if confidence C of the rule “s=>(1-s)” min_conf
  6. If the candidate set is not NULL, go to step 2.

Example for Apriori Algorithms

Market-Basket Analysis is one of the examples for Apriori.

  • Provides insight into which products tend to be purchased together and which are most amenable to promotion.
  • Actionable rules
  • Trivial rules
  • People who buy chalk-piece also buy duster
  • Inexplicable
  • People who buy mobile also buy bag

Database D

Minsup = 0.5

Apriori Algorithm Data sets

Apriori Algorithm: Pseudo code

  • Join step: is generated by joining  with itself
  • Prune Step: any (k-1) item set that is not frequent cannot be a subset of a frequent k-item set
  • Pseudo-code:

Apriori Algorithm psudo code

Limitations

  • Apriori algorithm can be very slow and the bottleneck is candidate generation.
  • For example, if the transaction DB has 104 frequent 1-itemsets, they will generate 107 candidate 2-itemsets even after employing the downward closure.
  • To compute those with sup more than min sup, the database need to be scanned at every level. It needs (+) scans, where n  is the length of the longest pattern

Methods To Improve Apriori’s Efficiency

  • Hash-based itemset counting: A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent
  • Transaction reduction: A transaction that does not contain any frequent k-itemset is useless in subsequent scans
  • Partitioning: Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB.
  • Sampling: mining on a subset of given data, lower support threshold + a method to determine the completeness
  • Dynamic itemset counting: add new candidate itemsets only when all of their subsets are estimated to be frequent

Apriori Advantages/Disadvantages

  • Advantages
  • Uses large itemset property
  • Easily parallelized
  • Easy to implement
  • Disadvantages
  • Assumes transaction database is memory resident.
  • Requires many database scans