Bayesian trees

We can define a tree as a set of nodes: \(T\).

For each node we define a splitting variable \(k\) and a splitting threshold \(r\).

Our prior is \(P(T, k, r)\).

We split this up to:

\(P(T, k, r)=P(T)P(k, r|T)\)

\(P(T, k, r)=P(T)P(k|T)P(r|T, k)\)

So we want to estimate:

\(P(T)\) - The number of nodes.
\(P(k|T)\) - Which variables we split by, given the tree size.
\(P(r|T, k)\) - The cutoff, given the tree size and the variables we are splitting by.

If at the leaf we have a parametric model, our prior is instead:

\(P(T, k, r, \theta )=P(T)P(k|T)P(r|T, k)P(\theta | T, k, r)\)

We then need to additionally estimate \(P(\theta | T, k, r)\).

We can generate a tree with a fixed number of leaves, according to our prior.

As we start the tree we assocaite the root node with a count of all leaves.

As we split a node, the remaining leaf counts are divided between the directions. If there is only one leaf left, we do no further splitting.

Call the collective parameters of the tree \(\Theta =(T, k, r)\) and \(\theta\).

Collectively our prior is defined by \(P(\Theta )\) and \(P(\theta )\)

We want to know the posterior given our data \(X\).

\(P(\Theta | X)=\dfrac{P(X|\Theta )P(\Theta )}{P(X)}\)

\(P(\Theta | X)\propto P(X|\Theta )P(\Theta )\)

We know explore \(P(X|\Theta )\)

\(P(X|\Theta )=\int P(X| \theta , \Theta)P(\theta)d\theta\)

This means our posterior is:

\(P(\Theta | X)\propto P(\Theta )\int P(X| \theta , \Theta)P(\theta)d\theta\)

This can be estimated with MCMC.