Association rules

The data

We have a transaction dataset, \(D\).

This includes transactions of items in \(I\).

Any subset of \(I\) is an itemset.

A subset of size \(k\) is a \(k\)-itemset.

Transactions are a \(k\)-itemset with a unique id, tid.

The set of all transactions is \(T\).

A tidset is a subset of \(T\).

Forming a lattice

We have a total order on the items.

An itemset \(ab\) is greater than \(a\), for examplle.

The two points of the lattice are the nullset, and \(I\).

Mappings

We have a mapping from \(I\) to \(T\) called \(t\).

We have another mapping from \(T\) to \(I\) called \(i\).

Frequency

We define the frequency of an itemset as the number of transactions it appears in.

We can write the frequency of \(A\) as \(\sum A\).

Strong rules

We use frequent patterns to generate strong rules, \(R\).

An example of a strong rule is \(a\rightarrow b\).

We can look at this by comparing the support of \(a\) to \(a\land b\).

\(supp(A\rightarrow B)=P(A\land B)\)

\(conf(A\rightarrow B)=P(B|A)\)

\(conf(A\rightarrow B)=\dfrac{P(A\land B)}{P(A)}\)

Support

We define the support of an itemset as the proportion of transactions which contain the itemset.

\(supp(A)=\dfrac{\sum A }{n}\).

We can also consider this as:

\(supp(A)=P(A)\)

Frequent patterns

A frequent itemset is one where the support is above a minimum.

We know that if an itemset is frequent, then all its subsets are also frequent.

We look for frequent patterns, \(F\), between the items \(I\).

An example of a frequent pattern is \(\{a,b\}\).

Confidence

The confidence of a frequent pattern is defined as:

\(conf (A\rightarrow B)=\dfrac{supp(A\land B)}{supp(A)}\)

Finding strong rules using the Apriori algorithm

Finding frequent patterns

We can use search algorithms to find frequent patterns. Starting at the empty set.

Apriori algorithm

Breadth first search to generate candidate set of itemsets with support above some value.

Start with a \(1\)-itemset, and increase \(k\) once done.

Once we have found a frequent pattern, we can immediately identify other frequent patterns associated with it.

We can do this by looking at confidence, not support.

Interest

An alternative measure for finding rules is to use interest.

\(Interest(A\rightarrow B)=\dfrac{P(A\land B)}{P(A)P(B)}\)

\(Interest(A\rightarrow B)=\dfrac{supp(A\land B)}{P(supp(A))P(supp(B))}\)

If this is \(1\), then they are independent.

If this is greater than \(1\), they are positively dependent.

If this is less than \(1\), they are negatively dependent.

Quantitative association rules

The search space is infinite in size. For example continuous age.

We choose intervals instead.