Worked on the RL project again today.
I implemented: 1. More updates (to the policy network) 2. Dynamic reward baselining, with rolling average of past rewards
I am also working in the oracle clusters paradigm. If we know the actual labels, we should be able to pick optimally. Yet somehow, we are unable to. We keep falling into policy collapse, due to the attractors (local minima) in the RL process.
I still want to try experience replay, as well as continuous action space prediction!
So I might try with a fixed horizon/budget policy network.