# More A2C in Tensorflow

*I started writing this post a long time ago, but never got to the point of finishing and publishing it. Since A3C and A2C still seem to be relevant I’ve decided to wrap it up and publish it.*

# Introduction

Ever since Deepmind’s publication on playing Atari games with Deep Reinforcement Learning I’ve been playing around with reinforcement learning. I used to work with it when I was studying, but that was in the 90s. A lot has changed since then.

A2C or *Advantage Actor Critic* is a popular reinforcement learning algorithm. Although you can download a good implementation from OpenAI’s Baselines, it is way more fun to implement it yourself.

I experienced that, apart from reading the paper, reading the experiences and code of other developers really helps understanding the algorithm. In this post I try to put down my explanation of the algorithm. If other blogs and papers do not help, then perhaps my will.

Before I start, I do want to mention some papers and websites which really helped me:

- The A2C paper
- A paper on Generalized Advantage Estimation (GAE)
- The source code from OpenAI baselines
- A post on A3C in Tensorflow from Arthur Juliani

But you should really start with this post of Rudi Gilman on A2C. It’s an excellent introduction into Actor-Critic algorithms.

I use Google’s Tensorflow, OpenAI’s Gym and Numpy as main Python libraries to implement A2C. I assume that the reader has some basic knowledge of Tensorflow, Gym and Numpy. I also assume that you have basic neural network knowledge experience. Make sure that you understand the MNIST example with Tensorflow.

# The model

The model we are training is a policy which takes images from an Atari game as input and gives a value for each possible action as output. The output with the highest value represents the action which is proposed by the policy as the best action.

with tf.variable_scope(scope): x = tf.placeholder(tf.uint8, shape=[None, input_res, input_res, 1], name='x') x_conv = tf.cast(x, tf.float32) / 255.0 conv1 = tf.layers.conv2d(inputs=x_conv, filters=32, kernel_size=[8,8], strides=(4, 4), activation=tf.nn.relu, kernel_initializer=xavier_initializer_conv2d()) conv2 = tf.layers.conv2d(inputs=conv1, filters=64, kernel_size=[4,4], strides=(2, 2), activation=tf.nn.relu, kernel_initializer=xavier_initializer_conv2d()) conv3 = tf.layers.conv2d(inputs=conv2, filters=64, kernel_size=[3,3], strides=(1, 1), activation=tf.nn.relu, kernel_initializer=xavier_initializer_conv2d()) conv_out = tf.contrib.layers.flatten(conv3) shared_dense = tf.layers.dense(conv_out, 512, activation=tf.nn.relu, kernel_initializer=xavier_initializer()) policy_logits = tf.layers.dense(shared_dense, number_of_actions, kernel_initializer=xavier_initializer())

The model is defined within a scope. This is needed for training purposes as we will see later. The input ** x** is a vector of square, 8-bit, grayscale images. These are converted to floating point values between 0 and 1. Following are 3 convolution layers and two dense layers. The convolutional layers will be trained to perform feature abstraction and the dense layers combine these features for classification (i.e. classify which action is the best). And that is it.

There are a few other addition to the model which need to be added for training purposes (within the same scope).

policy = tf.nn.softmax(policy_logits) value_function = tf.layers.dense(self.shared_dense, 1, kernel_initializer=xavier_initializer())

The ** softmax** function allows us to interpret the policy output as a probability distribution, which we need for the loss function. For training we also to train a second function, called the value function. This function has a single output, which estimates the future reward one can gain from a certain state. It is possible to train two different functions. However both of them are expected to require the same feature detection layers, which is why both functions share those layers.

Another name for the value function is the critic and another name for the policy is the actor. Hence we get the name *Actor Critic Algorithm*. Where the word *Advantage* comes from, will become clear in the next paragraphs.

# Training

## Loss function

The code below shows the definition of the loss function and the inputs for that loss function.

with tf.name_scope("{}_loss".format(scope)): reward = tf.placeholder(tf.float32, shape=[batch_size]) action = tf.placeholder(tf.int32, shape=[batch_size]) advantage = tf.placeholder(tf.float32, shape=[batch_size]) value_loss = 0.5 * tf.reduce_mean(tf.square(tf.squeeze(value_function) - reward)) action_one_hot = tf.one_hot(action, number_of_actions, dtype=tf.float32) neg_log_policy = -tf.log(tf.clip_by_value(policy, 1e-7, 1)) policy_loss = tf.reduce_mean(tf.reduce_sum(neg_log_policy * action_one_hot, axis=1) * advantage) self.entropy = tf.reduce_mean(tf.reduce_sum(policy * neg_log_policy, axis=1)) self.total_loss = 0.5 * value_loss + policy_loss - 0.01 * entropy

For both the policy and the value function separate loss functions are defined. The loss function for the value function is nothing more than the squared difference between the estimated value and the actual reward given a certain state. This loss function is also called the *L2 loss*. Instead of writing it like this, we could have used **tf.nn.l2_loss** as well. But I have made this example to gain more understanding of the algorithm and not to write the shortest and most efficient version. I therefore prefer to write out the complete function. The **tf.squeeze** is just a function, which make the shape of the value function equal to the reward tensor, just that we can subtract them.

The loss function for the policy is a bit more complicated. The action placeholder encodes the actions taken from a certain state as integer nummers. The policy however has an output for each action and gives the likelihood that a specific action will result in the highest score. This way of encoding is called *one hot encoding*. So the first step is to convert the action to a one hot encoded action.

To compute the policy loss (and later also the entropy) we need the -log(policy). The output of the policy is nicely clipped between a small non-zero number and 1 to prevent taking the log of zero (i.e. prevent creating a NaN).

The most important part of A3C and A2C loss functions is the next part where the loss is computed by multiplying the negative log of the policy output with the so called *Advantage*. There are many ways to compute the advantage (see the GAE paper). It basically boils down to determining if the actual results of an action was better or worse than the estimated the result. If the result was better (i.e. a positive advantage), we want to slightly increase the probability that the action is chosen. If the result was worse (a negative advantage), we want to lower the probability the action is chosen. The next paragraph shows what this concretely means.

Only the output related to the action needs to be adjusted, since we only know the result of that action. By multiplying the negative log of the policy output with the one hot encoded vectors, we force all outputs other than the one of the action to zero.

An optimizer needs a single loss function as input. The value and policy loss functions are added to create a single loss function. Training works best if the value loss has less influence than the policy loss. The value loss is therefore multiplied with 0.5.

Optimizing with this loss function could result in converging too quickly to a sub optimal solution. I.e. the probability of a single action is significant higher than any other, causing it to always be chosen. To prevent this we add a penalty on having a high entropy. The entropy here is the Shannon Entropy, which is the sum of each output multiplied by its negative log: *entropy = -sum(P * log(P))*. The entropy is multiplied with 0.01 to prevent the penalty being too high and preventing any convergence at all.

Note that *cross_entropy(Q,P) = -sum(Q * log(P))* for distributions *Q* and *P*. So the *entropy(P) = cross_entropy (P, P)*. So you could for example use **tf.nn.spare_softmax_cross_entropy_with_logits** as well, if you use **policy_logits** instead of **policy**.

One note relating the use of reduce_mean. Typically in textbooks the L2 norm, entropy, etc. are sums. This does mean however that the values change if the size of the input tensors changes. This size is related to the training batch size. Having different batch sizes influences the training performance. Taking the mean makes the results comparable.

## Testing and creating training data

For testing OpenAI’s Gym is used. If you do not have any experience with Gym, then first read the tutorial on running the Cartpole example.

s = start_state done = False states = [] rewards = [] actions = [] total_reward = start_reward for _ in range(5): a = pi(s, sess) s_next, r, done, _ = env.step(a) states.append(s) rewards.append(r) actions.append(a) total_reward += r s = s_next if done: s = s * 0 break

For 5 steps an action is selected given the current state using the function **pi(s, sess)**, where **s** is the current state and **sess** is the Tensorflow session. Then we take a single step in the simulation environment **env**. The step function returns the state **s_next** we’re in after executing the action **a**. The reward **r** gained during the step is returned. Being **done** means either that we missed the ball and lost or we finished the level.

We store each state **s**, the action **a** resulting given that state and the resulting reward **r**. A typical mistake here is storing the state **s_next** together with the action and result. Then you’re training the model to estimate which action it should have taken instead of training which action to take.

Another essential part of the algorithm is that we limit the amount of steps to 5** **instead of always running until **done** is **true**. The state **s** is stored at a later state and used again as the start state the next time we run this piece of code. This way a game is split in sections of 5 steps. This will help stabilize training in a later stage.

Note that I have omitted some functions in the code, which transform the RGB image from the environment to the greyscale image needed for the input. Actually more tricks are usefully, like adding up several subsequent frames, but this does not help understanding the algorithm.

The function **pi(s, sess)** runs the policy given the state **s**. This results in a probability distribution for all actions. Using this distribution we select a random action using the choice method from Numpy. This is another reason why it is practical that we applied softmax to the policy logits, such that the policy output can be used as a distribution.

def pi(s, sess): actions = sess.run(policy, feed_dict={x : [s]}) return np.random.choice(len(actions[0]), p=actions[0])

By not selecting the best action, but instead a random action we enforce exploration. Once we are done training we always take the action with the highest probability.

Once we finish the above loop the next piece we start constructing the training data.

R = 0 if not done: R = vf([s], sess) n = len(states) VF = vf(states, sess) RW = np.zeros(n) ADV = np.zeros(n) A = np.array(actions) for i in range(n - 1, -1, -1): R = rewards[i] + 0.99 * R RW[i] = R ADV[i] = R - VF[i]

Where the function **vf(states, sess)** is defined as:

def vf(states, sess): v = sess.run(value_function, feed_dict={x: states}) return v

**VF** is the vector with all the expected rewards per state, also called the *baseline*. **RW** is the vector with discounted rewards. This reward is computed from the reward obtained by executing the chosen action in this state + all the rewards we could get if we keep following this path of actions. This allows the algorithm to value actions which do not immediately result in a reward, but have a high reward in the future. However, we do not want the reward to be too far in the future. The total future reward is therefore discounted by factor 0.99. A reward 100 steps away has little influence this way.

This is the point that we compute the advantages and store them in vector **ADV**. As you can see the advantage is the difference between the expected total future reward and the actual future reward.

There is one problem though: We only know that for sure that when we’re done, that there is no future reward. If we finished the loop because we iterated for 5 times, then we do not know the total future reward we could get. In that case we use the value function to estimate the total future reward. This seems counter intuitive at first. We compute the advantage using an estimated value and a discounted reward which at its basis also has an estimated value. However, every time we do get to the end state we improve the value function a little. And when we improve the value function we improve the advantages. And improved advantages will result in an improved policy, which again results in reaching the end state with a higher reward, which improves the value function again, etc, etc.

At this point we have the training data for at most 5 steps in a single training environment. We repeat these steps for multiple environments, each with a different seed for the random generator, resulting in different games. In my application I use 64 different environments. The output of these are grouped in batches of 800 values per training data type. These batches are then fed to the training function.

## Training the model

Training is done by first collecting all the trainable variables of the model in local_vars. The gradients are computed using the loss function. Before applying these gradients, we clip them to prevent to big steps, causing us to get stuck in a local minimum.

with tf.variable_scope(scope): local_vars = tf.trainable_variables() with tf.name_scope("{}_gradient".format(scope)): lr = tf.placeholder(tf.float32) trainer = tf.train.RMSPropOptimizer(learning_rate=self.lr, decay=0.99, epsilon=1e-5) clip_norm = tf.placeholder(tf.float32) gradients = tf.gradients(total_loss, local_vars) global_norm = tf.global_norm(gradients) gradients, _ = tf.clip_by_global_norm(gradients, clip_norm) apply_gradients = trainer.apply_gradients(zip(gradients, local_vars))

The learning-rate LR is an input, such that I can slowly let it decrease. I let it start at 7e-4 and let it drop in 40 million steps to 1e-7.

# The results

One important part I am personally missing in many articles are details about the results. How do I know my implementation is working? And if it is not like the published results, is that normal? To help you, I have written down my results here. As you can see, the results are not as nice as the results published by OpenAI or DeepMind. A2C and A3C are very sensitive to its hyperparameters. Quite often the training just results in garbage.

I use Tensorboard to plot different statistics, like the high-score, the average score, but also the output of the loss functions, entropy, etc. One thing I always do, is record the input. Just to make sure that there are no bugs in converting the input. Here I display the first 6 frames used for input (not all 800 of course).

When we look at the performance I’m mostly interested in two plots: The high-score and the mean reward. The high-score gives me some insight in how far exploration got so far. The mean reward tells me how well the model can handle all situations.

After 11 million steps there is a very large jump in the high-score. This is when the model figures out how to get the ball behind the blocks where it keeps bouncing. This graph actually doesn’t say if that high score was a one-time lucky shot. I therefore prefer to also plot the distribution:

Here you can clearly see that it is not a one-time hit, but that it happens more often. You can also see that the model doesn’t gradually get better, but that the frequency of high scores goes up.

The mean is still a little more disappointing. This is partly of course because of exploration. The algorithm keeps trying new actions. Some of those will fail.

Plotting the entropy provides some valuable information about how well the model is performing. First of all, the model has 4 outputs (Breakout has either 4 or 6 actions. In my case 4). If each action is equally probable, then the entropy is – 4 * 1/4 * log(1/4) which is roughly 1.39. As long as the logged entropy is 1.39, the model does not perform better than just randomly picking an action. If the logged entropy is higher than 1.39, there is a bug in the program. It is good to just compute a few cases, to get a feeling for the probabilities and the related entropy. For example when one action’s probability is twice as high as the rest the entropy is – (3 * 0.2 * log(0.2) + 0.4 * log(0.4)) which is 1.33. If one action is 9 times more probable than the rest the entropy is -(3 * 0.1 * log(0.1) + 0.9 * log(0.9)), which is 0.79, etc, etc, etc.

In the plot we can clearly see the entropy start just below 1.39 and then it start to drop after a while. With a significant drop 10 million steps, which is when we start to get to very high scoring runs.

Below are the results of a few other runs, just to show the effect of small changes in learning rate. The first image below starts with a learning rate of 10e-3 instead of the 7e-4 of the results above. The learning rate is reduced to 1e-7 in 40 million steps. The jump in high score happens a lot earlier.

But the mean reward shows a complete different picture. The start is very good, but then learning becomes unstable and we end up at the same point.

Decreasing the learning rate in 30 million steps improves learning significantly:

I also played around with the number of steps I run the policy before interrupting. When I change it from 5 to 10, increased the start value of the learning rate to 15e-3 and decrease the learning rate in 40 million steps again, I get quite a nice improvement:

A very interesting observation when looking at the histogram, is that the results is either really poor or quite good, but nothing in between. I’m wonder why that is…

*That is it. Hopefully this article helped a little on your way. It is not as extensive as I initially planned. But writing articles is not my hobby, coding is. Perhaps if I’m up to it I elaborate more on certain subjects.*