In a Markov chain we move from state to state randomly, following a transtion matrix.
We have:
\(S\) - the state space
\(s_1\) - the initial state
\(P\) - the transition model
An agent gets a payoff which depends on the current state. We now have:
\(S\) - the state space
\(s_1\) - the initial state
\(P\) - the transition model
\(R\) - the reward distribution
We maximise the reward function using discounting.
\(E[\sum_{t=1}^\infty \gamma^{t-1}r_t|s_1]\)
In a MDP the agent can choose an action in each state. The action affects the transition matrix.
This means that the rewards depends on the actions taken. We now have:
\(S\) - the state space
\(s_1\) - the initial state
\(A\) - the action space
\(P\) - the transition model
\(R\) - the reward distribution
The decision maker needs to decide which action to take. For a MDP we assume that the agent knows:
The current state
The transition matrix
The payoffs
If the current state is not known, we have a Partially Observable Markov Decision Process.
If the transition matrix or payoffs are not known we have a reinforcement learning problem.
\(P(s_{t+1}|H)=P(s_{t+1}|s_t=s, a_t=a)=P_{s,a}\)
\(E[r_t|H]=E[r_t|s_t=s, a_t=a]=R_{s,a}\)
Expected return is greater or identical to any other policy.
We start with a random policy.
We then loop:
Evaluate the policy, using the Bellman equations.
Update the policy
We update the policy by changing \(a\) to maximise:
\(v_pi ' (s)= r_\pi + \gamma P_\pi v_\pi(s)\)
For example, if we have a policy of doing \(a\) in state \(s\), we would see if we increase the value if we change \(a\) to \(a'\).
Doesn’t directly calculuate policy. Doesn’t require inverting matrix.