We can train a decision tree by starting out with the most simple tree - all outcomes in same node.
We can then do a greedy search to identify which split on the node is best.
We can then iterate this process on future nodes.
We split nodes to increase maximum entropy.
Entropy is:
\(E= -\sum_i^n p_{i=1}\log_2 p_i\)
Where we are summing across all nodes.
The gain in entropy is the original entropy - weighted by size entropy of each branch
Training a decision tree until there is only one entry from the training set will result in overfitting.
We can use pruning to regularise trees.
From bottom, replace each node with a leaf of the most popular class. Accept if no reduction in accuracy.
Take full tree \(T_0\)
Iteratively find a subtree to replace with a leaf. Cost function is accuracy and number of leaves.
Remove this generating \(T_{i+1}\)
When we have just the root, choose a single tree using CV.
Generally we would split the data up. Grow the tree with one set and then prune with the other.
We can split our data up and iterate between growing and pruning.
When pruning, for each pair of leaves we test to see if they should be merged.
If our two sets are \(A\) and \(B\) we can do:
\(A\): Grow
\(B\): Prune
\(B\): Grow
\(A\): Prune
And repeat this process.
Once we have built a tree, we keep a single leaf and disard the rest.
Unbalanced dataset: more of one class than others.
Can reduce sample of majorit, or synthetically generate minority.