Deep Q learning, LSTM and Q-values convergence

Deep Q learning, LSTM and Q-values convergence - python

I am implementing a Reinforcement Learning agent that takes action given a time series of prices. The actions are, classically, buy sell or wait. The neural network gets as input one batch at the time, the window size is 96 steps, and I have around 80 features. So the input is something like 1x96x80. The algorithm is online and takes a random sample, every 96 new observations, from a replay memory that saves the last 480 observations (s,a,r,s'). I give a reward signal for each action for each timestep, and where the reward for buy is +1, the one for sell is -1 and so on. Thus I do not have to bother about exploration. I am using the standard way of calculating the loss (as in the Deep Mind original DQN paper), with two networks, one for the estimation of the Q values and the other that acts as a target and gets a soft update every step.
The agent picks at each step the action with the highest estimated Q values - as shown in the graphs below. My issue is with how the Q values behave depending on the architecture of the model. In the standard case I have two dense 'elu' layers, two LSTM layers and a final dense 'linear' layer with 3 units. With this configuration, the Q values fluctuate too heavily, I get a new high max every almost every step and the greedy policy picks different actions too frequently incurring in high transaction costs that destroy the performance (fig 1). In the other case (fig 2) I simply add another dense linear layer of 3 units before the last one, and in this case the Q values evolve much slower, increasing performance as it doesn't incur into high costs, but the tradeoff here is that I get a much less efficient learner which is slow in adapting to new conditions and keeps picking suboptimal actions for longer time, thus harming performance (but still, way better than before). For completeness, I tried with the LSTM returning the whole sequence and updating the gradient on that, and with only the last step. No real difference between the two.
Without a second linear layer
With a double linear layer The blue line is sell (and keep the short position), the orange is wait or close position, and the green is buy (or keep the long position). A position is therefore kept for longer periods if a curve is consistently higher than the others.
Ideally, I would like to find a way to fine tune between these two extremes, however, the second behavior appears only when I add a second dense layer of 3 units before the last one (it can be linear again or tanh), therefore I am not able to get all possibilities in between. Tuning other parameters, like the discount factor, the learning rate or the bias for LSTM does not really help (increasing the bias for LSTM to 5 does help but only in the first iterations, the it goes back to the same behaviour). Also, using GRU instead of LSTM does not change significantly the dynamics of the Q values.
I am at a dead end, any suggestion? I also do not understand why adding a simple linear final layer slows down the estimation of Q values so much.
EDIT: After enough iterations, even the second case (2 linear layers) slowly converge to the case where Q values are way too volatile. So the desired behaviour only last a few tens thousands of steps.

Related

Do LSTMs remember previous windows or is the hidden state reset?

I am training an LSTM to forecast the next value of a timeseries. Let's say I have training data with the given shape (2345, 95) and a total of 15 files with this data, this means that I have 2345 window with 50% overlap between them (the timeseries was divided into windows). Each window has 95 timesteps. If I use the following model:
input1 = Input(shape=(95, 1))
lstm1 = LSTM(units=100, return_sequences=False,
activation="tanh")(input1)
outputs = Dense(1, activation="sigmoid")(lstm1)
model = Model(inputs=input1, outputs=outputs)
model.compile(loss=keras.losses.BinaryCrossentropy(),
optimizer=tf.keras.optimizers.Adam(learning_rate=0.01))
I am feeding this data using a generator where it passes a whole file each time, therefore one epoch will have 15 steps. Now my question is, in a given epoch, for a given step, does the LSTM remember the previous window that it saw or is the memory of the LSTM reset after seeing each window? If it remembers the previous windows, then is the memory reset only at the end of an epoch?
I have seen similar questions like this TensorFlow: Remember LSTM state for next batch (stateful LSTM) or https://datascience.stackexchange.com/questions/27628/sliding-window-leads-to-overfitting-in-lstm but I either did not quite understand the explanation or I was unsure if what I wanted was explained. I'm looking for more of a technical explanation as to where in the LSTM architecture is the whole memory/hidden state reset.
EDIT:
So from my understanding there are two concepts we can call "memory"
here. The weights that are updated through BPTT and the hidden state
of the LSTM cell. For a given window of timesteps the LSTM can
remember what the previous timestep was, this is what the hidden
state is for I think. Now the weight update does not directly
reflect memory if I'm understanding this correctly.
The size of the hidden state, in other words how much the LSTM
remembers is determined by the batch size, which in this case is one
whole file, but other question/answers
(https://datascience.stackexchange.com/questions/27628/sliding-window-leads-to-overfitting-in-lstm and https://stackoverflow.com/a/50235563/13469674) state that if we
have to windows for instance: [1,2,3] and [4,5,6] the LSTM does not
know that 4 comes after 3 because they are in different windows,
even though they belong to the same batch. So I'm still unsure how
exactly memory is maintained in the LSTM
It makes some sense that the hidden state is reset between windows when we look at the LSTM cell diagram. But then the weights are only updated after each step, so where does the hidden state come into play?

What you are describing is called "Back Propagation Through Time", you can google that for tutorials that describe the process.
Your concern is justified in one respect and unjustified in another respect.
The LSTM is capable of learning across multiple training iterations (e.g. multiple 15 step intervals). This is because the LSTM state is being passed forward from one iteration (e.g. multiple 15 step intervals) to the next iteration. This is feeding information forward across multiple training iterations.
Your concern is justified in that the model's weights are only updated with respect to the 15 steps (plus any batch size you have). As long as 15 steps is long enough for the model to catch valuable patterns, it will generally learn a good set of weights that generalize well beyond 15 steps. A good example of this is the Shakespeare character recognition model described in Karpathy's, "The unreasonable effectiveness of RNNs".
In summary, the model is learning to create a good hidden state for the next step averaged over sets of 15 steps as you have defined. It is common that an LSTM will produce a good generalized solution by looking at data in these limited segments. Akin to batch training, but sequentially over time.
I might note that 100 is a more typical upper limit for the number of steps in an LSTM. At ~100 steps you start to see a vanishing gradient problem in which the earlier steps contribute nearly nothing to the gradient.
Note that it is important to ensure you are passing the LSTM state forward from training step to training step over the course of an episode (any contiguous sequence). If this step was missed the model would certainly suffer.

Python Sklearn Neural Network Classifier Iteration and Loss

I was trying to implement a paper I read. Basically it uses 3 neural network classifiers with different parameter to work on the same loan default data with 9 different training-to-testing ratios.
In order to find best parameter, we use following criterion, when(1) max_iteration=25000
and (2) Loss value is less than 0.008, we measure the accuracy value, and pick the best.
However, when I try to use python sklearnsklearn.neural_network.MLPClassifier to finish this, I met a problem. When the training-to-test ratio increases, the iterating time the program runs drops dramatically. In the mean while, the loss value increases,
Classifier Performance Table.
This is clearly not what I want, the iterating number should keep rising to 25000 before stop.
This is how I defined classifer:
clf1= MLPClassifier(activation='relu',solver='sgd',early_stopping=False,alpha=1e-5,max_iter=25000,\
hidden_layer_sizes=(18),momentum=0.7,learning_rate_init=0.0081,tol=0,random_state=3)
clf2= MLPClassifier(activation='relu',early_stopping=False,solver='sgd',alpha=1e-5,max_iter=25000,\
hidden_layer_sizes=(23),momentum=0.69,learning_rate_init=0.0095,tol=0,random_state=3)
clf3= MLPClassifier(activation='relu',early_stopping=False,solver='sgd',alpha=1e-5,max_iter=25000,\
hidden_layer_sizes=(27),momentum=0.79,learning_rate_init=0.0075,tol=0,random_state=3)
As you can see, I already set Tolerance=0, so that every time when we iterate, it could surely decrease the loss. And I had tried other value, but still, the iterate number is way smaller than I expected.
Hope someone can help me, thanks!

Normalizing Rewards to Generate Returns in reinforcement learning

The question is about vanilla, non-batched reinforcement learning. Basically what is defined here in Sutton's book.
My model trains, (woohoo!) though there is an element that confuses me.
Background:
In an environment where duration is rewarded (like pole-balancing), we have rewards of (say) 1 per step. After an episode, before sending this array of 1's to the train step, we do the standard discounting and normalization to get returns:
returns = self.discount_rewards(rewards)
returns = (returns - np.mean(returns)) / (np.std(returns) + 1e-10) // usual normalization
The discount_rewards is the usual method, but here is gist if curious.
So an array of rewards [1,1,1,1,1,1,1,1,1] becomes an array of returns [1.539, 1.160, 0.777, 0.392, 0.006, -0.382, -0.773, -1.164, -1.556].
Given that basic background I can ask my question:
If positive returns are enforced, and negative returns are discouraged (in the optimize step), then no matter the length of the episode, roughly the first half of the actions will be encouraged, and the latter half will be discouraged. Is that true, or am I misunderstanding something?
If its NOT true, would love to understand what I got wrong.
If it IS true, then I don't understand why the model trains, since even a good-performing episode will have the latter half of its actions discouraged.
To reiterate, this is non-batched learning (so the returns are not relative to returns in another episode in the training step). After each episode, the model trains, and again, it trains well :)
Hoping this makes sense, and is short enough to feel like a proper clear question.

Background
Yes, positive rewards are better than negative rewards
No, positive rewards are not good on an absolute scale
No, negative rewards are not bad on an absolute scale
If you increase or decrease all rewards (good and bad) equally, nothing changes really.
The optimizer tries to minimize the loss (maximize the reward), that means it's interested only in the delta between values (the gradient), not their absolute value or their sign.
Reinforcement Learning
Let's say your graph looks something like this:
...
logits = tf.nn.softmax(...)
labels = tf.one_hot(q_actions, n_actions)
loss = tf.losses.softmax_cross_entropy(labels, logits, weights=q_rewards)
The losses for the individual "classes" get scaled by weights which in this case are q_rewards:
loss[i] = -q_rewards[i] * tf.log( tf.nn.softmax( logits[i] ) )
The loss is a linear function of the reward, the gradient stays monotonic under linear transformation.
Reward Normalization
doesn't mess with the sign of the gradient
makes the gradient steeper for rewards far from the mean
makes the gradient shallower for rewards near the mean
When the agent performs rather badly, it receives much more bad rewards than good rewards. Normalization makes the gradient steeper for (puts more weight on) the good rewards and shallower for (puts less weight on) the bad rewards.
When the agent performs rather good, it's the other way 'round.
Your questions
If positive returns are enforced, and negative returns are discouraged (in the optimize step) ...
It's not the sign (absolute value) but the delta (relative values).
... then no matter the length of the episode, roughly the first half of the actions will be encouraged, and the latter half will be discouraged.
If there are either much more high or much more low reward values, then you have a smaller half with a steeper gradient (more weight) and a larger half with a shallower gradient (less weight).
If it IS true, then I don't understand why the model trains, since even a good-performing episode will have the latter half of its actions discouraged.
Your loss value is actually expected to stay about constant at some point. So you have to measure your progress by running the program and looking at the (un-normalized) rewards.
For reference, see the example network from Google IO:
github.com/GoogleCloudPlatform/tensorflow-without-a-phd/.../tensorflow-rl-pong/... and search for _rollout_reward
This isn't a bad thing, however. It's just that your loss is (more or less) "normalized" as well. But the network keeps improving anyway by looking at the gradient at each training step.
Classification problems usually have a "global" loss which keeps falling over time. Some optimizers keep a history of the gradient to adapt the learning rate (effectively scaling the gradient) which means that internally, they also kinda "normalize" the gradient and thus don't care if we do either.
If you want to learn more about behind-the-scenes gradient scaling, I suggest taking a look at ruder.io/optimizing-gradient-descent
To reiterate, this is non-batched learning (so the returns are not relative to returns in another episode in the training step). After each episode, the model trains, and again, it trains well :)
The larger your batch size, the more stable your distribution of rewards, the more reliable the normalization. You could even normalize rewards across multiple episodes.

In my opinion, the accepted answer is wrong.
I read it, and I thought it was plausible, and then I stopped worrying about gradient normalization and checked something else. Only much later did I notice that it was precisely the gradient normalization breaking my training process.
First off, "Reward Normalization doesn't mess with the sign of the gradient" is just plain wrong.
returns = (returns - np.mean(returns)) / (np.std(returns) + 1e-10)
Obviously, if you subtract the mean, that'll flip some signs. So yes, reward normalization does affect the sign of the gradient.
Second, tf.losses.softmax_cross_entropy is, in everyday words, a measurement of how many plausible options the AI had when choosing what it did. Select 1 out of 10 actions randomly? Your cross-entropy is very high. Always select the exact same item? Your cross-entropy is low, because the other choices are irrelevant if you statistically never take them.
In line with that, what
loss[i] = -q_rewards[i] * tf.log( tf.nn.softmax( logits[i] ) )
actually does is this:
If your reward is positive, it will minimize the cross-entropy, meaning it will increase the chance that the AI will take the exact same action again when it sees a similar situation in the future.
If your reward is negative, it will maximize the cross-entropy, meaning it will make the AI choose more randomly when it sees a similar situation in the future.
And that's the purpose of reward normalization: Yes, after normalization, half of the items in your trajectory have a positive sign and the other half has a negative sign. What you are basically saying is: Do more of these things that worked, try something random for these things.
And that leads to very actionable advice:
If your model is behaving too randomly, make sure you have enough positive rewards (after normalization).
If your model is always doing the same and not exploring, make sure you have enough negative rewards (after normalization).

Tensorflow : What is actually tf.nn.dropout output_keep_prob?

I am trying to understand concept of output_keep_prob:
So if my example is simple RNN :
with tf.variable_scope('encoder') as scope:
cells = rnn.LSTMCell(num_units=500)
cell = rnn.DropoutWrapper(cell=cells, output_keep_prob=0.5)
model = tf.nn.bidirectional_dynamic_rnn(cell, cell, inputs=embedding_lookup, sequence_length=sequence_le,
dtype=tf.float32)
My confusion is if I am giving output_keep_prob=0.5 what actually it means? I know that it makes less prone to overfitting (called regularizing) by adding a dropout. It randomly turns off activations of neurons during training, ok I got this point but I am confused when I give
output_keep_prob=0.5 and my no_of_nodes = 500 then 0.5 means it will randomly turn of 50% nodes at each iteration or it means it will keep only those connections which have more or equal probability of 0.5
keep_layers whose probability =>0.5
or
turn off 50% randomly nodes unit at each iteration ??
I tried to understand the concept by this stackoverflow answer but there is also same confusion that what actually 0.5 means? it should drop 50% nodes at each iteration or keep only those nodes which have probability more or equal to 0.5
if the answer is second keep only those nodes which have probability more or equal to 0.5 :
then it means suppose I have given 500 nodes units and only 30 nodes have 0.5 probability so it will turn of rest 470 nodes and will use only 30 nodes for incoming and outgoing connections?
Because this answer says :
Suppose you have 10 units in the layer and set the keep_prob to 0.1,
Then the activation of 9 randomly chosen units out of 10 will be set
to 0, and the remaining one will be scaled by a factor of 10. I think
a more precise description is that you only keep the activation of 10
percent of the nodes.
While other side this answer by #mrry says :
it means that each connection between layers (in this case between the
last densely connected layer and the readout layer) will only be used
with probability 0.5 when training.
can anyone give a clear explanation which one is correct and what actually this value represent in keep_prob?

Keep_prop means the probability of any given neuron's output to be preserved (as opposed to dropped, that is zeroed out.) In other words, keep_prob = 1 - drop_prob.
The tf.nn.dropout() description states that
By default, each element is kept or dropped independently.
So if you think about it, if you have a large amount of neurons, like 10,000 in a layer, and the keep_prob is let's say, 0.3, then 3,000 is the expected value of the number of neurons kept. So it's more or less the same thing to say that a keep_prob of 0.3 means to keep the value of 3,000 randomly chosen ones of the 10,000 neurons. But not exactly, because the actual number might vary a bit from 3,000.
Scaling comes into the picture because if you drop a certain number of neurons, then the expected sum of the layer will be reduced. So the remaining ones are multiplied to feed forward the same magnitude of values as they would otherwise. This is especially important if you load a pretrained network and want to continue training but with a different keep_prob value now.
(Please note, you can decide to introduce non-independence into the drop probabilities with the noise_shape argument, please see the tf.nn.drouput() description, but that is outside of the scope of this question.)
The random decision to drop a neuron or not is recalculated for each invocation of the network, so you will have a different set of neurons dropped on every iteration. The idea behind dropout is that subsequent layers cannot overfit and learn to watch for arbitrary constellations of certain activations. You ruin the "secret plan of lazy neurons to overfit" by always changing which previous activations are available.

An intuitive explanation for dropout efficiency might as follows.
Imagine that you have a team of workers and the overall goal is to learn how to erect a building. When each of the workers is overly specialized, if one gets sick or makes a mistake, the whole building will be severely affected. The solution proposed by “dropout” technique is to pick randomly every week some of the workers and send them to business trip. The hope is that the team overall still learns how to build the building and thus would be more resilient to noise or workers being on vacation.
Because of its simplicity and effectiveness, dropout is used today in various architectures, usually immediately after fully connected layers.
Generalization in machine learning refers to how well the concepts learned by the model apply to examples which were not seen during training. The goal of most machine learning models is to generalize well from the training data, in order to make good predictions in the future for unseen data. Overfitting happens when the models learns too well the details and the noise from training data, but it doesn’t generalize well, so the performance is poor for testing data. It is a very common problem when the dataset is too small compared with the number of model parameters that need to be learned. This problem is particularly acute in deep neural networks where is not uncommon to have millions of parameters.
Here is visual explanation i found :
More info here.

Python: neural net cost function stuck at ln(2)

Was just wondering if anyone else has encountered this behavior before. I couldn't find this specific issue being brought up in my searches. I wrote my own neural net (a very simple one) using just numpy, and it appears to work, the cost function decreases as I iterate over it. However, when I change the random initialization on the weights from using np.random.randn(shape) to np.random.randn(shape)*0.01 (I heard that using smaller initial weights might speed up learning because I'm using a sigmoid layer), my cost function starts at .69~ln(2) and pretty much gets stuck there. This happens no matter how many times I restart the neural net, and no matter what kind of inputs I'm putting into it. I find this very odd indeed. I should add, that if I start off without the 0.01 multiplication factor, the cost function will decrease to less than .69.
The neural net uses a cross-entropy cost function, and uses gradient descent to make steps. No regularization has been implemented. This behavior doesn't seem to depend on what the neural network's dimensions (number of layers, neurons per layer) are, or the learning rate, only on whether I initialize the starting weights with or without multiplying them by 0.01.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.