There are a lot of RL posts and papers that discuss return normalization. This post discusses a problem with a common strategy, and my attempt to fix.
Background: Credit Assignment
With RL, we need to guess how correct each action was, at every step, given the overall performance during an episode. This is known as the “credit assignment problem.”
One common way to address credit assignment is to use discounting. If we have an array of rewards like [1,1,1,1,1,1,1], where the time moves from first element in array to last, we might update the values like so [7,6,5,4,3,2,1]. This is a way of saying, “whatever we did early was really good, because it got us to the finish line, and as we got closer to the finish line, things mattered less and less.”
There are various strategies here, depending on the environment (aka the game).
Once we’ve done credit assignment, there is usually a form of return normalization that helps networks train a bit better. This normalization decreases the range of the returns and the code is as follows:
returns = returns - mean(returns) returns = returns / stand_dev(returns)
The first line yields a new array, the first half of which are positive, and the second half of which are negative. For example, if we started with the array:
The first line would leave us with:
[4.5, 3.5, 2.5, 1.5, 0.5, -0.5, -1.5, -2.5, -3.5, -4.5]
The second line would take the first and leave us with:
[ 1.5666989 , 1.21854359, 0.87038828, 0.52223297, 0.17407766,
-0.17407766, -0.52223297, -0.87038828, -1.21854359, -1.5666989 ]
The above is what will be passed into the optimizer. But what this is saying is: encourage the actions associated with the positive values, and discourage the actions associated with negative values.
But why? What sort of credit assignment philosophy is this? Why are we discouraging the second half of the episode? We are not even taking into account how well the episode did.
It makes more sense is to look at a finished episode, and compare its performance to a running average of performance. If the episode did better than average, reward the actions in that episode. If it did worse than average, discourage the actions.
The code (in Python) would look like this, again, where the environment’s goal is to keep a pole on a cart balanced for as long as possible:
<br /># run 10 episodes without training, placing length of each episode into action_lengths array. # discount rewards of episode to get an array of returns returns = discount(rewards) # divide all returns of new episode by first (and largest) return. # All values now between 0 and 1. returns = returns / returns # compare episode length to average length. relative = len(actions) - np.average(action_lengths[-10:]) # now we can make the returns encourage or discourage # based on how the policy performed in the episode. # If episode did better than average, # the 'relative' variable will be positive. # Otherwise, it will be negative. returns = returns * relative
Exp 1: Cart-Pole Continuous Action Space
Here I used the Roboschool environment, which features the Bullet 3D physics engine.
I ran 5 trials, with both methods. The trial would end when the running average of steps reached 1000. The std_deviation normalization yielded an average of 612 episodes. The method presented in this post yielded an average of 435 episodes. Thats reaching the optimal performance in about 70% of the time of the standard method.
Exp 2: Cart-Pole Discrete Action Space
Here I used the exact code found in the list of posts below: the Tensorflow Tutorial which features the standard openAI cart-pole env in the discrete space. Here the run would be considered “done” when the average length of the episode was 200 steps.
Using their code, after five trials, the average episode length was 196. Using the strategy presented in this post, the average episode length was 113. Thats almost half as many episodes needed to reach optimal performance.