RL Notes (short)
Things to remember:
-
discounted return is with respect to the future even though, pragmatically, this is usually seen as an eligibility trace. This is because of the Markov Property.
-
$\gamma = 0$ implies a myopic agent – one that only cares about immediate reward.
-
$\gamma = 1$ implies a farsighted agent – one that cares about optimizing future reward.
-
$0 < \gamma < 1$ implies that an agent cares about the future to varying degrees. Cart-pole is an environment where we can model the reward in two different ways, which require different levels of $\gamma$ for any RL agent.
- In one situation we set the reward to be +1. In this case any positive discount will enforce that future reward will accumulate as the agent steps through time.
- Alternatively, we can set the reward to be -1. In this case an agent must set the discount rate to be $0 < \gamma < 1$. In the case of $\gamma = 0$ the agent fails to learn anything (all immediate rewards are 0 and the agent forgets everything by the final timestep), and at $\gamma = 1$ the agent treats all episode lengths the same. With $0 < \gamma < 1$, the -1 value will be minimized the larger we can set $t$ in $(-1)\cdot \gamma^t$.
- in general we choose a discount rate that is closer to 1 than 0, like 0.9, otherwise the agent will be shortsighted.
-
The most simple policy is a deterministic one. This is usually the first set of rules/hueristics you would find in a naive business solution. Bumping these up to stochastic policies is as simple as an MDP and as complicated as function approximation.
-
Definition: $\pi\prime \geq \pi$ if and only if $v_{\pi}\prime(s) \geq v_\pi(s) \forall s \in S$. Following this we get that an optimal policy $\pi*$ is where $\pi* \geq \pi \forall \pi$
-
When you talk about deterministic action-value functions for a given policy, you are asking about the optimal route at each state. This is an example of how you can compare state-value and action-value functions, but you can only do this for a deterministic policy.