We have a transaction dataset, \(D\).
This includes transactions of items in \(I\).
Any subset of \(I\) is an itemset.
A subset of size \(k\) is a \(k\)-itemset.
Transactions are a \(k\)-itemset with a unique id, tid.
The set of all transactions is \(T\).
A tidset is a subset of \(T\).
We have a total order on the items.
An itemset \(ab\) is greater than \(a\), for examplle.
The two points of the lattice are the nullset, and \(I\).
We have a mapping from \(I\) to \(T\) called \(t\).
We have another mapping from \(T\) to \(I\) called \(i\).
We define the frequency of an itemset as the number of transactions it appears in.
We can write the frequency of \(A\) as \(\sum A\).
We use frequent patterns to generate strong rules, \(R\).
An example of a strong rule is \(a\rightarrow b\).
We can look at this by comparing the support of \(a\) to \(a\land b\).
\(supp(A\rightarrow B)=P(A\land B)\)
\(conf(A\rightarrow B)=P(B|A)\)
\(conf(A\rightarrow B)=\dfrac{P(A\land B)}{P(A)}\)
We define the support of an itemset as the proportion of transactions which contain the itemset.
\(supp(A)=\dfrac{\sum A }{n}\).
We can also consider this as:
\(supp(A)=P(A)\)
A frequent itemset is one where the support is above a minimum.
We know that if an itemset is frequent, then all its subsets are also frequent.
We look for frequent patterns, \(F\), between the items \(I\).
An example of a frequent pattern is \(\{a,b\}\).
The confidence of a frequent pattern is defined as:
\(conf (A\rightarrow B)=\dfrac{supp(A\land B)}{supp(A)}\)
We can use search algorithms to find frequent patterns. Starting at the empty set.
Breadth first search to generate candidate set of itemsets with support above some value.
Start with a \(1\)-itemset, and increase \(k\) once done.
Once we have found a frequent pattern, we can immediately identify other frequent patterns associated with it.
We can do this by looking at confidence, not support.
An alternative measure for finding rules is to use interest.
\(Interest(A\rightarrow B)=\dfrac{P(A\land B)}{P(A)P(B)}\)
\(Interest(A\rightarrow B)=\dfrac{supp(A\land B)}{P(supp(A))P(supp(B))}\)
If this is \(1\), then they are independent.
If this is greater than \(1\), they are positively dependent.
If this is less than \(1\), they are negatively dependent.
The search space is infinite in size. For example continuous age.
We choose intervals instead.