Q-values exploding when training DQN

Q-values exploding when training DQN - python

I'm training a DQN to play OpenAI's Atari environment, but the Q-values of my network quickly explode far above what is realistic.
Here's the relevant portion of the code:
for state, action, reward, next_state, done in minibatch:
if not done:
# To save on memory, next_state is just one frame
# So we have to add it to the current state to get the actual input for the network
next_4_states = np.array(state)
next_4_states = np.roll(next_4_states, 1, axis=3)
next_4_states[:, :, :, 0] = next_state
target = reward + self.gamma * \
np.amax(self.target_model.predict(next_4_states))
else:
target = reward
target_f = self.target_model.predict(state)
target_f[0][action] = target
self.target_model.fit(state, target_f, epochs=1, verbose=0)
The discount factor is 0.99 (it doesn't happen with discount factor 0.9, but also doesn't converge because it can't think far enough ahead).
Stepping through the code, the reason it's happening is all the Q values that aren't meant to be updated (the ones for actions we didn't take) increase slightly. It's my understanding that passing the networks own output to the network during training should keep the output the same, not increase or decrease it. Is there something wrong with my model? Is there some way I can mask the update so it only updates the relevant Q value?
EDIT: My model creation code is here:
def create_model(self, input_shape, num_actions, learning_rate):
model = Sequential()
model.add(Convolution2D(32, 8, strides=(4, 4),
activation='relu', input_shape=(input_shape)))
model.add(Convolution2D(64, 4, strides=(2, 2), activation='relu'))
model.add(Convolution2D(64, 3, strides=(1, 1), activation='relu'))
model.add(Flatten())
model.add(Dense(512, activation='relu'))
model.add(Dense(num_actions))
model.compile(loss='mse', optimizer=Adam(lr=learning_rate))
return model
I create two of these. One for the online network and one for the target.

Which predictions get updated?
Stepping through the code, the reason it's happening is all the Q values that aren't meant to be updated (the ones for actions we didn't take) increase slightly. It's my understanding that passing the networks own output to the network during training should keep the output the same, not increase or decrease it.
Below I have drawn a very simple neural network with 3 input nodes, 3 hidden nodes, and 3 output nodes. Suppose that you have only set a new target for the first action, and simply use the existing predictions as targets again for the other actions. This results in only a non-zero (for simplicity I'll just assume greater than zero) error (denoted by delta in the image) for the first action/output, and errors of 0 for the others.
I have drawn the connections through which this error will be propagated from output to hidden layer in bold. Note how each of the nodes in the hidden layer still gets an error. When these nodes then propagate their errors back to the input layer, they'll do this through all of the connections between input and hidden layer, so all of those weights can be modified.
So, imagine all those weights got updated, and now imagine doing a new forwards pass with the original inputs. Do you expect output nodes 2 and 3 to have exactly the same outputs as before? No, probably not; the connections from hidden nodes to the last two outputs may still have the same weights, but all three hidden nodes will have different activation levels. So no, the other outputs are not guaranteed to remain the same.
Is there some way I can mask the update so it only updates the relevant Q value?
Not easily no, if at all. The problem is that the connections between pairs of layers other than the connections between the final pair are not action-specific, and I don't think you want them to be either.
Target Network
Is there something wrong with my model?
One thing I'm seeing is that you seem to be updating the same network that is used to generate targets:
target_f = self.target_model.predict(state)
and
self.target_model.fit(state, target_f, epochs=1, verbose=0)
both use self.target_model. You should use separate copies of the network for those two lines, and only after longer periods of time copy the updated network's weights into the network used to compute targets. For a bit more on this, see Addition 3 in this post.
Double DQN
Apart from that, it is well known that DQN can still have a tendency to overestimate Q values (though it generally shouldn't completely explode). This can be addressed by using Double DQN (note: this is an improvement that was added later on top of DQN).

I encountered similar problem in double DQN. Lowering the learning rate resolved it.

Related

Neural network with linear activation output. Calculate output range for each of the output neurons

Let's assume I have a neural network like the following:
model = keras.models.Sequential()
model.add(keras.layers.Dense(10, input_shape=(5,), activation='relu'))
model.add(keras.layers.Dense(4, activation='linear'))
With n output neurons with a linear activation function.
The training process is not important here, so we can take a look at the random weights that keras initialized using:
model.weights
Of course, in a real example, these weights should be adjusted in the training process.
Depending on these model.weights, each of the output neurons returns values in a range.
I would like to calculate this exact range.
Does keras offer any function to calculate it?
I built a flawed piece of code to make an approximation of it, using a loop and predicting random inputs. But this would not be really useful in a real example with much more inputs/neurons/weights.
Here a few examples trying to clarify my question (All of them assume that the input values are between and 1):
model = keras.models.Sequential()
model.add(keras.layers.Dense(1, input_shape=(2,),
activation='linear', use_bias=False))
model.set_weights([np.array([1, 1]).reshape(2, 1)])
For the previous example the output neuron results would be between 0 and 2
model.set_weights([np.array([-0.5, 1]).reshape(2, 1)])
For the previous example the output neuron results would be between -0.5 and 1
model = keras.models.Sequential()
model.add(keras.layers.Dense(2, input_shape=(2,), activation='linear', use_bias=False))
model.add(keras.layers.Dense(1, activation='linear', use_bias=False))
model.set_weights([np.array([1, 1, 1, 1]).reshape(2,2), np.array([1, 1]).reshape(2,1)])
For the previous example, the output neuron results would be between 0 and 4
These are simplified examples. In a real scenario with a much complex network structure, activation functions, bias..... these ranges are not obvious to calculate.

It sounds like you are roughly interested in what is referred to as neural network verification. This field broadly consists of answering the question: given a range of possible inputs, what is the range of possible outputs from a neural network with a given set of weights? A few things to note:
A neural network is essentially a complex, non-linear function. That is, it maps the input space to the output space. Defining an output range does not make sense except with respect to an input range. In your question you make no reference to the inputs, so your examples are flawed/incomplete.
In general, neural network verification is an emerging field with most published works being fairly recent (last 5-7 years). That being said, there are exact and approximate methods for fully connected networks with a variety of activation functions. I'll list a few such methods here:
https://arxiv.org/abs/2004.05519 - MATLAB toolbox, but you could export your neural network in ONNX format and then use MATLAB for the verification/output range analysis.
https://arxiv.org/abs/1804.10829 - specifically for ReLU activation function.
https://anwu1219.github.io/download/Marabou.pdf with python API available here: https://github.com/NeuralNetworkVerification/Marabou
The field is still evolving so you may have to do some of the codings yourself rather than using pre-existing libraries in some cases, but these papers/ a search query for neural network verification should at least give you some ideas of where to start.

IMO, there is no such a function, as far as I know, to estimate the output value's range( without imposing your restriction).
For example, a dense function without bias is just a plain linear function of a=bx, in your case, you are restricting x to 0-1 range and explicitly setting b to your desired values.
You will always get the value in those ranges you`ve cited in your question. A hypothetical example is to choose b randomly and the range in your questions would not hold the ground.
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
model = keras.models.Sequential()
model.add(keras.layers.Dense(1, input_shape=(2,), activation='linear', use_bias=False))
import matplotlib.pyplot as plt
#model.set_weights([np.array([1, 1]).reshape(2, 1)])
eval_func = keras.backend.function([model.input], model.layers[-1].output)
outputs = eval_func(np.array([[2,1]]))
counts, bins = np.histogram(outputs)
plt.hist(bins[:-1], bins, weights=counts)

Experiment shows that LSTM does worse than Random Forest... Why?

LSTM is supposed to be the right tool to capture path-dependency in time-series data.
I decided to run a simple experiment (simulation) to assess the extent to which LSTM is better able to understand path-dependency.
The setting is very simple. I just simulate a bunch (N=100) of paths coming from 4 different data generating processes. Two of these processes represent a real increase and a real decrease, while the other two fake trends that eventually revert to zero.
The following plot shows the simulated paths for each category:
The candidate machine learning algorithm will be given the first 8 values of the path ( t in [1,8] ) and will be trained to predict the subsequent movement over the last 2 steps.
In other words:
the feature vector is X = (p1, p2, p3, p4, p5, p6, p7, p8)
the target is y = p10 - p8
I compared LSTM with a simple Random Forest model with 20 estimators. Here are the definitions and the training of the two models, using Keras and scikit-learn:
# LSTM
model = Sequential()
model.add(LSTM((1), batch_input_shape=(None, H, 1), return_sequences=True))
model.add(LSTM((1), return_sequences=False))
model.compile(loss='mean_squared_error', optimizer='adam', metrics=['accuracy'])
history = model.fit(train_X_LS, train_y_LS, epochs=100, validation_data=(vali_X_LS, vali_y_LS), verbose=0)
# Random Forest
RF = RandomForestRegressor(random_state=0, n_estimators=20)
RF.fit(train_X_RF, train_y_RF);
The out-of-sample results are the summarized by the following scatter plots:
As you can see, the Random Forest model is clearly outperforming the LSTM. The latter seems to be not able to distinguish between the real and the fake trends.
Do you have any idea to explain why this is happening?
How would you modify the LSTM model to make it better at this problem?
Some remarks:
The data points are divided by 100 to make sure gradients do not explode
I tried to increase the sample size, but I noticed no differences
I tried to increase the number of epochs over which the LSTM is trained, but I noticed no differences (the loss becomes stagnant after a bunch of epochs)
You can find the code I used to run the experiment here
Update:
Thanks to SaTa's reply, I changed the model and obtained much better results:
# Updated LSTM Model
model = Sequential()
model.add(LSTM((8), batch_input_shape=(None, H, 1), return_sequences=False))
model.add(Dense(4))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam', metrics=['accuracy'])
Still, the Random Forest model does better. The point is that RF seems to understand that, conditional on the class, a higher p8 predicts a lower outcome p10-p8 and viceversa because of the way the noise is added. LSTM seems to fail on that, so it predicts the class rather well, but we see that within-class downward-sloping pattern in the final scatter plot.
Any suggestion to improve on that?

I won't expect LSTM to win at all the battles against traditional methods, but I do expect it to perform well for the problem you have posed. Here are couple things you can try:
1) Increase the number of hidden units in the first layer.
model.add(LSTM((32), batch_input_shape=(None, H, 1), return_sequences=True))
2) The output of an LSTM layer is tanh by default which limits the output to (-1, 1) as you can see in the right plot. I recommend either adding a Dense layer or using LSTM with linear activation on the output. Like this:
model.add(LSTM((1), return_sequences=False, activation='linear'))
Or
model.add(LSTM((16), return_sequences=False))
model.add(Dense(1))
Try the above with 10K samples that you have.

Tensorflow: Multiple loss functions vs Multiple training ops

I am creating a Tensorflow model which predicts multiple outputs (with different activations). I think there are two ways to do this:
Method 1: Create multiple loss functions (one for each output), merge them (using tf.reduce_mean or tf.reduce_sum) and pass it to the training op like so:
final_loss = tf.reduce_mean(loss1 + loss2)
train_op = tf.train.AdamOptimizer().minimize(final_loss)
Method 2: Create multiple training operations and then group them like so:
train_op1 = tf.train.AdamOptimizer().minimize(loss1)
train_op2 = tf.train.AdamOptimizer().minimize(loss2)
final_train_op = tf.group(train_op1 train_op2)
My question is whether one method is advantageous over the other? Is there a third method I don't know?
Thanks

I want to make a subtle point that I don't think was made in previous answers.
If you were using something like GradientDescentOptimizer, these would be very similar operations. That's because taking gradients is a linear operation, and the gradient of a sum is the same as the sum of the gradients.
But, ADAM does something special: regardless of the scale of your loss, it scales the gradients so that they're always on the order of your learning rate. If you multiplied your loss by 1000, it wouldn't affect ADAM, because the change it would be normalized away.
So, if your two losses are roughly the same magnitude, then it shouldn't make a difference. If one is much larger than the other, then keep in mind that summing before the minimization will essentially ignore the small one, while making two ops will spend equal effort minimizing both.
I personally like dividing them up, which gives you more control over how much to focus on one loss or the other. For example, if it was multi-task learning, and one task was more important to get right than the other, two ops with different learning rates roughly accomplishes this.

The difference between the two methods is demonstrated clearly in this post on multi-task learning in tensorflow.
In short:
Method 1:
This is called joint training, since it directly adds the losses together, the result is that all the gradients and updates are done with respect to both losses at the same time. Generally this is used when training multiple outputs using the same set of input features.
Method 2:
This creates two separate optimizers and is called alternate training. This is used when you use a subset of input features for each of the outputs. Therefore, when feeding in the feature subset for train_op1, the sub-graph for train_op2 is untouched. Each optimizer can be called in an alternating order using different input features.
If you run both optimizer concurrently with the same input data, then the differences with method 1 is probably very minor.

The method 1 is the correct one because you're defining only once the gradient graph (for computing the backpropagation). In this way, you use a single loss function with a single graph, for doing a single update of the same parameter (the update takes into account both terms of the loss).
The second method, instead, defines 2 different graphs for computing the gradient, and is wrong.
When you execute the training op, you're executing in parallel (because you used tf.group / tf.tuple / tf.control_dependencies) the computation of the training operations.
The operations will compute two different losses and two different set of updated variables.
When the moment of updating the variables comes, you have a problem:
which update operation executes first, the one defined by the first graph or the other?
And in any case, you're discarding one computation, because one will overwrite the other. There's no synchronization in the update and there's no relation in the computed losses.

Both of the method you recommended are correct. The difference is quite subtle. Main difference is that AdamOptimizer keeps separate gradient accumulators for each loss in second solution. Which one works better needs an experiment.

I will showcase how to implement a regression model using Tensorflow's functional API.
In multi-task learning, we need a base network that is shared between tasks and a network head for each individual task:
from tensorflow.keras import layers, models, Model
def create_base_cnn(input_shape):
model = models.Sequential()
model.add(layers.Conv2D(filters=32, kernel_size=(3, 3), padding="same", activation="relu", input_shape=input_shape))
model.add(layers.Conv2D(filters=32, kernel_size=(3, 3), padding="same", activation="relu"))
# put more layers if you like
model.add(layers.Dense(128, activation="relu"))
return model
def create_head(input_shape, name):
model = models.Sequential(name=name)
model.add(layers.Dense(128, activation="relu", input_shape=input_shape))
model.add(layers.Dense(64, activation="relu"))
# put more layers if you like
model.add(layers.Dense(1, activation="linear"))
return model
We can now combine the base model with the heads.
# Create the model.
input_shape = (240, 180, 1)
base_model = create_base_cnn(input_shape)
head_model1 = create_head((128,), name="head1")
head_model2 = create_head((128,), name="head2")
model_input = layers.Input(shape=input_shape)
# Combine base with heads (using TF's functional API)
features = base_model(model_input)
model_output1 = head_model1(features)
model_output2 = head_model2(features)
model = Model(inputs=model_input, outputs=[model_output1, model_output2])
Finally to train the model we can refer to the different outputs by name (in my case: "head1" and "head2"). We can define a hyperparameter for the weight of each head in the loss function:
HEAD1_WEIGHT = 0.4
HEAD2_WEIGHT = 0.6
model.compile(
optimizer="Adam",
loss={"head1": "mse", "head2": "mse"},
loss_weights={"head1": HEAD1_WEIGHT, "head2": HEAD2_WEIGHT},
metrics={"head1": ["mae"], "head2": ["mae"]}
)
model.fit(dataset_training, validation_data, epochs)

Training on multiple time-series of various length using recurrent layers in Keras

TL;DR - I have a couple of thousand speed-profiles (time-series where the speed of a car has been sampled) and I am unsure how to configure my models such that I can perform arbitrary forecasting (i.e. predict t+n samples given a sample t).
I have read numerous explanations (1, 2, 3, 4, 5) about how Keras implements statefulness in their recurrent layers, and how one should reset/not reset between iterations, etc..
However, I am unable to acquire the model shape that I want (I think).
As for now, I am only working with a subset of my profiles (denoted as routes in the code below).
Number of training routes: 90
Number of testing routes: 10
The routes vary in length, hence, the first thing I do is to iterate through all routes and pad them with 0, so they are all the same length. (I have assumed this is required, if I am wrong please let me know.) After the padding I convert the routes into a format better suited for the supervised learning task, as described HERE. In this case I have opted to forecast the succeeding 5 steps of the current sample.
The result is a tensor, as:
Shape of trainig_data: (90, 3186, 6) == (nb_routes, nb_samples/route, nb_timesteps)
which is split into X and y for training as:
Shape of X: (90, 3186, 1)
Shape of y: (90, 3186, 5)
My goal is to have the model take one route at the time and train on it. I have created a model like this:
# Create model
model = Sequential()
# Add recurrent layer
model.add(SimpleRNN(nb_cells, batch_input_shape=(1, X.shape[1], X.shape[2]), stateful=True))
# Add dense layer at the end to acquire correct kind of forecast
model.add(Dense(y.shape[2]))
# Compile model
model.compile(loss="mean_squared_error", optimizer="adam", metrics = ["accuracy"])
# Fit model
for _ in range(nb_epochs):
model.fit(X, y,
validation_split=0.1,
epochs=1,
batch_size=1,
verbose=1,
shuffle=False)
model.reset_states()
Which would imply that I have a model with nb_cells layers, the input of the model is (number_of_samples, number_of_timesteps) i.e. (3186, 1) and the output of the model is (number_of_timesteps_lagged) i.e. (5).
However, when running the above I get the following error:
ValueError: Error when checking target: expected dense_1 to have 2 dimensions, but got array with shape (90, 3186, 5)
I have tried different ways to solve the above, but I have been unsuccessful.
I have also tried other ways of structuring my data and my model. For instance merging my routes such that instead of (90, 3186, 6) I had (286740, 6). I simply took the data for each route and put it after the other. After fiddeling with my model I got this to run, and I get a result that is quite good, but I really want to understand how this works - and I think the solution I am attempting above is bette (if I can get it to work).
Update
Note: I am still looking for feedback.
I have reached a "solution" which I think does the trick.
I have abandoned the padding and instead opted for a one sample at the time approach. The reason being that I am trying to acquire a network that allows me to predict by providing the network with one sample at the time. I want to give the network sample t and have it predict t+1, t+2, ...,t+n, so it is my understanding that I must train the network on one sample at the time. I also assume that using:
stateful will allow me to keep the hidden state of the cells unspoiled between batches (meaning that I can determine the batch size to be len(route))
return_sequences will allow me to get the output vector that I desire
The changed code is given below. Unlike the original question, the shape of the input data is now (90,) (i.e. 90 routes of various length) but each training route still has only one feature per sample, and each label route has five samples per feature (the lagged time).
# Create model
model = Sequential()
# Add nn_type cells
model.add(SimpleRNN(nb_cells, return_sequences=True, stateful=True, batch_input_shape=(1, 1, nb_past_obs)))
# Add dense layer at the end to acquire correct kind of forecast
model.add(Dense(nb_future_obs))
# Compile model
model.compile(loss="mean_squared_error", optimizer="adam", metrics = ["accuracy"])
# Fit model
for e in range(nb_epochs):
for r in range(len(training_data)):
route = training_data[r]
for s in range(len(route)):
X = route[s, :nb_past_obs].reshape(1, 1, nb_past_obs)
y = route[s, nb_past_obs:].reshape(1, 1, nb_future_obs)
model.fit(X, y,
epochs=1,
batch_size=1,
verbose=0,
shuffle=False))
model.reset_states()
return model

Solving FizzBuzz with Keras

I am trying to solve FizzBuzz using Keras and it works quite well for numbers between 1 and 10.000 (90-100% win rate and close to 0 loss). However, if I try even higher numbers, that is numbers between 1 and 100.000 it doesn't seem to perform well (~50% win rate, loss ~0.3). In fact, it performs quite poorly and I have no clue what I can do to solve this task. So far I am using a very simple neural net architecture with 3 hidden layers:
model = Sequential()
model.add(Dense(2000, input_dim=state_size, activation="relu"))
model.add(Dense(1000, activation="relu"))
model.add(Dense(500, activation="relu"))
model.add(Dense(num_actions, activation="softmax"))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=["accuracy"])
I found that the more neurons I have the better it performs, at least for numbers below 10.000.
I am training my neural net in a step-wise fashion, meaning that I am not computing the inputs and targets beforehand, but instead train the network step by step. Again, this works quite well and it shouldn't make a difference right? Here's the main loop:
for epoch in range(np_epochs):
action = random_number()
x_raw = to_binary(action)
x = np.expand_dims(x_raw, 0)
prediction = model.predict(x)
y, victory, _, _ = check_prediction(action, prediction)
memory.append((x_raw, y))
curr_batch_size = min(batch_size, len(memory))
batch = random.sample(memory, curr_batch_size)
inputs = []
targets = []
for i, t in batch:
inputs.append(i)
targets.append(t)
if victory:
wins += 1
loss, accuracy = model.train_on_batch(np.array(inputs), np.array(targets))
As you can see, I am training my network not on decimal numbers but convert them into binary first before feeding it into the net.
Another thing to mention here is that I am using a memory, to make it more like a supervised problem. I thought it may perform better if train on numbers that the neural net has already been trained on. It doesn't seem to make any difference at all.
Is there anything I can do to solve this particular problem with a neural net? I mean is it so hard for a function approximator to figure out the simple math behind FizzBuzz? Am I doing something wrong? Do you suggest a different architecture?
See my code on MachineLabs. You can simply fork my lab and fiddle with it if you want. To view to code, simply click on the 'Editor' tab at the top.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.