Vipul Chaskar's Blog: Implementation of Apriori Algorithm

Apriori algorithm is used for finding frequently occurring items and associative rule mining from from an input database which is transactional. Associative rule mining and Apriori algorithm are part of a bigger domain of data mining. Data mining is basically the process of discovering patterns in large data sets. There can be many applications of apriori algorithm e.g. market basket analysis. Here is a link I found particularly useful for learning Apriori Algorithm and associative rule mining:

http://www.cs.uic.edu/~liub/teach/cs583-spring-12/cs583.html

I have implemented the first two passes of Apriori Algorithm as a part of an academic assignment. It was for constructing attack graphs which are used for threat prediction and vulnerability assessment. The algorithm is implemented in python and its very simple. It can be extended for k passes of the algorithm.

Here is the code:

minsup = 0.3
minconf = 0.8

def count_first(transactions):
adict = {}
for t in transactions:
for item in t:
if item in adict:
adict[item] += 1
else:
adict[item] = 1
return adict

def find_frequent(Candidate, minsup, no_of_lines):
adict={}
for key in Candidate:
if ((float)(Candidate[key]/no_of_lines)) >= minsup:
adict[key] = Candidate[key]
return adict

def candidate_gen(keys):
adict={}
for i in keys:
for j in keys:
if i != j and (j,i) not in adict:
adict[tuple([min(i,j),max(i,j)])] = 0
return adict

def add_frequency(Candidate, transactions):
for key in Candidate:
for t in transactions:
if key[0] in t and key[1] in t:
Candidate[key] += 1
return Candidate

f = open("testcase.txt","r")
transactions = []
no_of_lines=0

for line in f:
split_line = line.split()
transactions.append(split_line)
no_of_lines = no_of_lines + 1

print(no_of_lines)

#First iteration
C1 = count_first(transactions)
F1 = find_frequent(C1,minsup,no_of_lines)
#Second iteration
C2 = candidate_gen(F1.keys())
C2 = add_frequency(C2,transactions)
F2 = find_frequent(C2,minsup,no_of_lines)
print(F2)

It accepts input data from the file "testcase.txt". In this file, each new line represents one transaction. Each such transaction contains items represented by 'integers' and separated by spaces. So, essentially, each line is a collection of integers separated by spaces.

The program makes two passes of apriori over input data. It outputs pairs of frequent item sets along with their 'support' metric values.

Comments and suggestions are welcome :)

Vipul Chaskar's Blog

Monday, February 10, 2014

Implementation of Apriori Algorithm

2 comments: