Bayesian inference means we have full distribution of \(p(w)\), not just moments of a specific point estimate
\(H(P,Q)=E_P(I(Q))\)
So for a discrete distribution this is:
\(H(P,Q)=-\sum_x P(x)\log Q(x)\)
\(Q\) is prior
\(P\) is posterior
When we move from a prior to a posterior distribution, the entropy of the probability distribution changes.
\(D_{KL}(P||Q)=H(P,Q)-H(P)\)
KL divergence is also called the information gain.
\(D_{KL}(P||Q)\ge 0\)