Consider Bayes’ theorem
\(P(y|x_1,x_2,...,x_n)=\dfrac{P(x_1,x_2,...,x_n|y)P(y)}{P(x_1,x_2,...,x_n)}\)
Here, \(y\) is the label, and \(x_1,x_2,...,x_n\) is the evidence. We want to know the probability of each label given evidence.
The denominator, \(P(x_1,x_2,...,x_n)\), is the same for all, so we only need to identify:
\(P(y|x_1,x_2,...,x_n)\propto P(x_1,x_2,...,x_n|y)P(y)\)
We assume each \(x\) is independent. Therefore:
\(P(x_1,x_2,...,x_n|y)=P(x_1|y)P(x_2|y)...P(x_n|y)\)
\(P(y|x_1,x_2,...,x_n)\propto P(x_1|y)P(x_2|y)...P(x_n|y)P(y)\)
With the Naive Bayes assumption we have:
\(P(y|x_1,x_2,...,x_n)\propto P(x_1|y)P(x_2|y)...P(x_n|y)P(y)\)
We now choose \(y\) which maximises this.
This is easier to calculate, as there is less of a sample restriction.
This is used when evidence is also in classes, as the chance of any individual outcome on a continuous probability is \(0\).
We can easily calculate \(P(y)\), by looking at the frequency across the sample.
Normally, \(P(x_1|y)=\dfrac{n_c}{n_y}\), where:
\(n_c\) is the number of instances where the evidence is \(c\) and the label is \(y\).
\(n_y\) is the number of instances where the label is \(y\).
To reduce the risk of specific probabilities being zero, we can adjust them, so that:
\(P(x_1|y)=\dfrac{n_c+mp}{n_y+m}\), where:
\(p\) is the prior probability. If this is unknown, use \(\dfrac{1}{k}\), where \(k\) is the number of classes.
\(m\) is a parameter called the equivilant sample size.
Naive Bayes can be used to classify text documents. The \(x\) variables can be appearances of each word, and \(y\) can be the document classification.