RL example


Here are my thoughts on trying to provide an example of RL.

  1. Policy gradient
  2. “Standing derivative”, weighted by the return
  3. Instead, we should do Q-learning