We can define a tree as a set of nodes: \(T\).
For each node we define a splitting variable \(k\) and a splitting threshold \(r\).
Our prior is \(P(T, k, r)\).
We split this up to:
\(P(T, k, r)=P(T)P(k, r|T)\)
\(P(T, k, r)=P(T)P(k|T)P(r|T, k)\)
So we want to estimate:
\(P(T)\) - The number of nodes.
\(P(k|T)\) - Which variables we split by, given the tree size.
\(P(r|T, k)\) - The cutoff, given the tree size and the variables we are splitting by.
If at the leaf we have a parametric model, our prior is instead:
\(P(T, k, r, \theta )=P(T)P(k|T)P(r|T, k)P(\theta | T, k, r)\)
We then need to additionally estimate \(P(\theta | T, k, r)\).
We can generate a tree with a fixed number of leaves, according to our prior.
As we start the tree we assocaite the root node with a count of all leaves.
As we split a node, the remaining leaf counts are divided between the directions. If there is only one leaf left, we do no further splitting.
Call the collective parameters of the tree \(\Theta =(T, k, r)\) and \(\theta\).
Collectively our prior is defined by \(P(\Theta )\) and \(P(\theta )\)
We want to know the posterior given our data \(X\).
\(P(\Theta | X)=\dfrac{P(X|\Theta )P(\Theta )}{P(X)}\)
\(P(\Theta | X)\propto P(X|\Theta )P(\Theta )\)
We know explore \(P(X|\Theta )\)
\(P(X|\Theta )=\int P(X| \theta , \Theta)P(\theta)d\theta\)
This means our posterior is:
\(P(\Theta | X)\propto P(\Theta )\int P(X| \theta , \Theta)P(\theta)d\theta\)
This can be estimated with MCMC.