RL Notes (short)
Things to remember:

discounted return is with respect to the future even though, pragmatically, this is usually seen as an eligibility trace. This is because of the Markov Property.

$\gamma = 0$ implies a myopic agent – one that only cares about immediate reward.

$\gamma = 1$ implies a farsighted agent – one that cares about optimizing future reward.

$0 < \gamma < 1$ implies that an agent cares about the future to varying degrees. Cartpole is an environment where we can model the reward in two different ways, which require different levels of $\gamma$ for any RL agent.
 In one situation we set the reward to be +1. In this case any positive discount will enforce that future reward will accumulate as the agent steps through time.
 Alternatively, we can set the reward to be 1. In this case an agent must set the discount rate to be $0 < \gamma < 1$. In the case of $\gamma = 0$ the agent fails to learn anything (all immediate rewards are 0 and the agent forgets everything by the final timestep), and at $\gamma = 1$ the agent treats all episode lengths the same. With $0 < \gamma < 1$, the 1 value will be minimized the larger we can set $t$ in $(1)\cdot \gamma^t$.
 in general we choose a discount rate that is closer to 1 than 0, like 0.9, otherwise the agent will be shortsighted.

The most simple policy is a deterministic one. This is usually the first set of rules/hueristics you would find in a naive business solution. Bumping these up to stochastic policies is as simple as an MDP and as complicated as function approximation.

Definition: $\pi\prime \geq \pi$ if and only if $v_{\pi}\prime(s) \geq v_\pi(s) \forall s \in S$. Following this we get that an optimal policy $\pi*$ is where $\pi* \geq \pi \forall \pi$

When you talk about deterministic actionvalue functions for a given policy, you are asking about the optimal route at each state. This is an example of how you can compare statevalue and actionvalue functions, but you can only do this for a deterministic policy.