The Naive Bayes classifier

Naive Bayes

The Naive Bayes posterior

Bayes theorem

Consider Bayes’ theorem

\(P(y|x_1,x_2,...,x_n)=\dfrac{P(x_1,x_2,...,x_n|y)P(y)}{P(x_1,x_2,...,x_n)}\)

Here, \(y\) is the label, and \(x_1,x_2,...,x_n\) is the evidence. We want to know the probability of each label given evidence.

The denominator, \(P(x_1,x_2,...,x_n)\), is the same for all, so we only need to identify:

\(P(y|x_1,x_2,...,x_n)\propto P(x_1,x_2,...,x_n|y)P(y)\)

The assumption of Naive Bayes

We assume each \(x\) is independent. Therefore:

\(P(x_1,x_2,...,x_n|y)=P(x_1|y)P(x_2|y)...P(x_n|y)\)

\(P(y|x_1,x_2,...,x_n)\propto P(x_1|y)P(x_2|y)...P(x_n|y)P(y)\)

The Naive Bayes classifier

Calculating the Naive Bayes estimator

With the Naive Bayes assumption we have:

\(P(y|x_1,x_2,...,x_n)\propto P(x_1|y)P(x_2|y)...P(x_n|y)P(y)\)

We now choose \(y\) which maximises this.

This is easier to calculate, as there is less of a sample restriction.

This is used when evidence is also in classes, as the chance of any individual outcome on a continuous probability is \(0\).

Estimating \(P(y)\)

We can easily calculate \(P(y)\), by looking at the frequency across the sample.

Estimating \(P(x_1|y)\)

Normally, \(P(x_1|y)=\dfrac{n_c}{n_y}\), where:

\(n_c\) is the number of instances where the evidence is \(c\) and the label is \(y\).
\(n_y\) is the number of instances where the label is \(y\).

Regularising the Naive Bayes estimator

To reduce the risk of specific probabilities being zero, we can adjust them, so that:

\(P(x_1|y)=\dfrac{n_c+mp}{n_y+m}\), where:

\(p\) is the prior probability. If this is unknown, use \(\dfrac{1}{k}\), where \(k\) is the number of classes.
\(m\) is a parameter called the equivilant sample size.

Text classification using Naive Bayes

Naive Bayes and text classification

Naive Bayes can be used to classify text documents. The \(x\) variables can be appearances of each word, and \(y\) can be the document classification.