Dynamic programming is similar to divide and conquer algorithms, in that both solve sub-problems.
However, if dynamic problems, the sub-problems overlap.
A policy maps the state onto the action
\(a_t=\pi (s_t)\)
The policy does not need to change over time, as discounting is constant. That is, if the policy should be different in future, it should be different now.
The policy affects the transition model, and so we have \(P_\pi\).
There exists a policy that is better than any other policy, under any starting state.
There is no closed form solution to finding the optimal policy.
There are instead iterative methods.
We breakdown the value function into an immediate reward, and the discounted value function of the next state.
This is because the expectation function is linear.
\(v_\pi (s)=R_{s,\pi(s)}+\gamma \sum_{s'}P_{s,\pi(s)}(s')v_\pi (s')\)
We can write this in matrix form.
\(v_pi (s)= r_\pi + \gamma P_\pi v_\pi(s)\)
We can then solve this:
\(v_pi (s)= (I-\gamma P_\pi)^{-1})r_\pi\)
This depends on the starting state.