Here are my thoughts on trying to provide an example of RL. Policy gradient “Standing derivative”, weighted by the return Instead, we should do Q-learning