Copycat Reinforcement Learning

Copycat Reinforcement Learning

Tabula Rasa Reinforcement Learning for Real Robots….Sucks.

One of the main problems reinforcement learning faces today is the amount of training required to achieve a desired level of skill. It can take thousands, even millions of episodes to learn decently. This is because often RL starts a network from scratch (aka tabula rasa), a blank slate. When it comes to real robots, where the robot trains in real-time, millions of episodes is unacceptable. In fact, this idea and post came about as a result of reading this post, a fantastic piece by Sridhar Mahadevan about the problems with Tabula Rasa RL.

Simulation Based Reinforcement Learning for Real Robots…Sucks.

Thus, people have turned to creating models of the robots and training them in simulation, where the training can happen in speeds much faster than real time. This has worked to varying degrees of success, and is a form of transfer learning. A simulation however requires a mathematical model of the robot. If a model were readily accessible, one so realistic it’s training carries over to the real robot, it begs the question: why not just use classic control? The value of reinforcement learning comes from the fact we don’t need a model.

So then what Doesn’t Suck for Real Robots?

The question then becomes: if a model is NOT accessible, or accurate enough to translate well to a real robot, and modeless training takes millions of episodes, what are we able to do?

Often the skill the robot is trying to learn has already been learned by another “expert” system. Perhaps the expert system is another robot that’s learned to balance (real or virtual), or another robot that’s learned to walk, or if the skill is bipedal walking, any healthy adult would quality as the expert system.

Is it possible to utilize these expert systems when training our own robots, even without a model? Can a robot “watch” and use what it sees to help it learn, without involving any sort of models, yet drastically cut down on the training sessions needed?

Yes.

The following algorithm falls into the Imitation Learning family of algorithms. Furthermore, it falls under the Behavioral Cloning division of imitation learning. However, it deviates slightly from standard behavior cloning, since (as you will read below) we do not have access to the expert’s action commands. We only have access to their states, as if we were watching footage of them, with no insight into what commands of the expert’s neural network outputs.

Assumptions in Plain English

To get our imaginations on the same page, lets say the system to train is a pole balancing on a cart which moves along a track (the classic cart-pole control problem).

The teacher system might be a human moving the cart back and forth to keep the pole balanced, or an existing algorithm. Lets also assume the teacher system is not even the same physical system as the system we wish to train. It might be slightly larger in scale, a slightly different length pole, different motor, etc.. We have access to recordings of the teacher system’s state changes, not its actions. So from the recorded session we can see the pole’s change in angle, the change in the cart’s x-position. But the recordings don’t show the force exerted on the cart. In other words we have the kinematics but not the dynamics. In the diagram above, the recordings contain X and $\theta$ but not force F. Similar to watching someone perform an acrobatic feat we wish to learn: we can see how their body moves, but we don’t have access to what their brains told their muscles to do.

Finally lets say that for the cartpole, in a recording each step (“frame” in the “footage”) holds the following information:

1. angle ($\theta$)
2. x-position (x)
3. rate of change of angle ($\dot{\theta}$)
4. rate of change of x-position ($\dot{x}$)

Algorithm in Plain English

1. Get recordings from the teacher (human, robot, simulation, etc…) system, as mentioned above.
2. Record student system trying the thing.
3. For each step (frame in the recorded footage) of student system:
A. Look at the state ($\theta$, $\dot{\theta}$, x, $\dot{x}$ )
B. In the recordings of the teacher system, look through each step and find the closest matching state.
C. Now look at the NEXT frame in the student system, and the NEXT frame in the teacher system, and compare THOSE states.
D. If the student system did something similar to what the teacher system did in THOSE NEXT states, the student system did well – reinforce that behavior. If the student system did something differently, it did not do well – discourage that behavior.

The main thing we want to know is: how well did the student do at any given step. In school we might call this the student’s “grade”. Here is how we define the student’s grade ($G$) at any given step:

Given the state of the student step at time t:

• Let $S(t)$ be the student’s state at time $t$.
• Let $T(S(t))$ be the closest matching teacher state of $S(t)$.
• We shorten this to $T$ for readability.
• Let $T_{next}$ be the state in the step following $T$.
• Let $D(t)$ be the Euclidean distance between $S(t)$ and $T$ in N-dimensional space.
• Let $D(t+1)$ be the Euclidean distance between $S(t+1)$ and $T_{next}$
• Let $G(t)$ be the grade of the student at time $t$, which can range between $[-1,1]$. This range is good because its zero centered, which we want when training networks.

$G(t) = 0.5^{D(t)} * tanh(2\frac{D(t)}{D(t+1)} - 1.5)$

That looks like a lot so lets break it down:

• The first term is $0.5^{D(t)}$. Remembering that the range of $D(t)$ is $[0, \infty]$, this first term yields a range between $[1,0]$.
• It looks like this:
• It’s effect in plain english is: “if we found a good match between student and teacher for this state, make this grade count for a lot!”.
• The second term is $tanh(2*\frac{D(t)}{D(t+1)} - 1.5)$. We can talk more about $\frac{D(t)}{D(t+1)}$ next but for now lets see a graph of $tanh(2x-1.5)$:

• *Notice that for values greater than 0.75, $y$ is positive, and approaches 1. For values less than 0.75, $y$ is negative and approaches -1.
• Why 0.75? No strong reason. It felt acceptable for the student to be “close” to the teacher at $D(t+1)$ relative to $D(t)$, even if the student did not exactly match the teacher. 0.75 felt reasonable. Feel free to change.
• Finally, the ratio $\frac{D(t)}{D(t+1)}$ is simply looking at how the student did at time $t$ vs time $t+1$. In other words, did the student maintain a relative distance between the two timesteps?

Refresher: Vanilla Reinforce Algorithm

• Input: differentiable policy $\pi(a|s,\theta)$,
• Parameters: step sizes $\alpha > 0$
• Initialize parameters $\theta$
• Repeat Forever:
• Generate an episode of length T
• For each step of episode t = 0,...,T-1:
• $G$ = return from step t
• $\theta \leftarrow \theta + \alpha G\nabla_{\theta}\pi(a|s,\theta)$

Copycat is basically the above, except G, the return is defined by how well the student copies the states of the teacher.

Copycat Algorithm in Mathy Pseudocode

Now we have the abstract idea, lets see it in mathy pseudocode. In sections below I offer a refresher in Vanilla Reinforce, and talk about the type of network used for training:

• Input: differentiable policy $\pi(a|s,\theta)$,
• Parameters: step sizes $\alpha > 0$
• Initialize parameters $\theta$
• Given recordings of Teacher episodes
• Repeat Forever:
• Generate a student episode of length
• For each step of episode:
• Given $S(t)$, which is the state of student step $t$, find the closest teacher state $T$
• Determine $D(t)$, the Euclidean distance between S(t) and T.
• Determine $D(t+1)$, the Euclidean distance between $S(t+1)$ and $T_{next}$
• Determine $G(t) = 0.5^{D(t)} * tanh(2\frac{D(t)}{D(t+1)} - 1.5)$
• Finally, update the parameters:$\theta \leftarrow \theta + \alpha G\nabla_{\theta}\pi(a|s,\theta)$

Algorithm in Cody-Pseudocode

For the coders out there who don’t like maaaath:

// some min dist between matching
// student/teacher states
MIN_DIST =

// a state at time n (sn) is an array:
sn = [x_n, xdot_n, theta_n, thetadot_n]

// given a recording of a teacher episode,
// which is an array of states:
teacher_episode = [s0, s1,...,sn]

// and an episode from student
student_episode = [s0, s1,...,sn]

// remember the length of episode of teacher vs student does not matter.

// this is a naive search
for student_index in student_episode
// ss is current student state
ss = student_episode[student_index]

// for min_dist pick some number that will be
// larger than all real distances
min_dist = 1000

best_match = null
for teacher_index in teacher_episode
// ts is current teacher state
ts = teacher_episdoe[teacher_index]

dist = distance(ss,ts)
if dist < min_dist min_dist = dist best_match = ts best_index = teacher_index // is our min_dist small enough? if not, we don't // have a close enough teacher state, so skip this // state from student's episode. next if min_dist > MIN_DIST

// now we have the best match, lets see
// how the student's NEXT state compares
// to the teachers NEXT state.
next_ts = teacher_episode[best_index + 1]
next_ss = student_episode[student_index + 1]
next_dist = distance(next_ss, next_ts)
ratio_dist = min_dist / next_dist

// the smaller the next_dist (relative to min_dist),
grade = 0.5^min_dist * tanh(2* ratio_dist - 1.5)

// now do something with the grade,
// like train the network. this will amount
// to reinforcing the **action** the student
// took when in state ss, if grade is good,
// and discouraging **action** student took


This Algorithm is Not…

1. Not Transfer Learning, because in TL, the same network is used when trained on problem A, to solve problem B.
2. Not Supervised Learning, because in SL, the training set’s output would (in the case of robotics) be an action. However this algorithm does not utilize inputs and expected outputs. It uses adjacent state pairs.

Algorithm Main Wins

1. helps the robot escape local minima, and reach the (hopefully) global minima seen in the recordings of the teacher system.
2. avoids tabula rasa, getting the robot moving towards the global minima right away, saving a lot of time
3. frames are judged on a per-frame basis. Either your action got you relatively close to the expected next state, or it didn’t. No more credit assignment problem.

Grouping Similar Recorded State Pairs

If we’ve recorded a few trained teacher system episodes (or thousands), there might be grouping of similar frame-pairings. It would be nice to expose these groups and merge them into an average frame-pairings that has a stronger “weight.” This reduces search space, and also solves a problem with noise, as described below.

Potential Problems

The main problem i see with the idea so far is that it will enforce both the good pairs as well as the pairs that represent perturbations (moving from a good state to a bad state). How is the training robot to know which pairs were good recovery pairs, versus a gust of wind nocking the teacher over? I suspect that given the grouping of pairs mentioned in the previous section, the perturbations should be drowned out by the average pairs.

Mixing Vanilla Reinforce and CopyCat

We now have two totally different, independent ways to perform reinforcement learning. They both have pros and cons. But we don’t need to just choose one! Here is an example of mixing the two:

• Input: differentiable policy $\pi(a|s,\theta)$,
• Parameters: step sizes $\alpha > 0$
• Initialize parameters $\theta$
• Repeat Forever:
• Generate an episode of length T
• For each step of episode t = 0,...,T-1:
• $G_{cc} = 0.5^{D(t)} * tanh(2\frac{D(t)}{D(t+1)} - 1.5)$
• $G_v$ = vanilla reinforce return from step t
• $G = G_v*K_v + G_{cc} * K_{cc}$
• where $K_{cc}$ and $K_v$ are constants defining how much we care about each of the two methods.
• $\theta \leftarrow \theta + \alpha G\nabla_{\theta}\pi(a|s,\theta)$