This question comes from watching the following video on TensorFlow and Reinforcement Learning from Google I/O 18: https://www.youtube.com/watch?v=t1A3NTttvBA
Here they train a very simple RL algorithm to play the game of Pong.
In the slides they use, the loss is defined like this ( approx # 11m 25s ):
loss = -R(sampled_actions * log(action_probabilities))
Further they show the following code ( approx # 20m 26s):
# loss
cross_entropies = tf.losses.softmax_cross_entropy(
onehot_labels=tf.one_hot(actions, 3), logits=Ylogits)
loss = tf.reduce_sum(rewards * cross_entropies)
# training operation
optimizer = tf.train.RMSPropOptimizer(learning_rate=0.001, decay=0.99)
train_op = optimizer.minimize(loss)
Now my question is this; They use the +1 for winning and -1 for losing as rewards. In the code that is provided, any cross entropy loss that's multiplied by a negative reward will be very low? And if the training operation is using the optimizer to minimize the loss, well then the algorithm is trained to lose?
Or is there something fundamental I'm missing ( probably because of my very limited mathematical skills )
Great question Corey. I am also wondering exactly what this popular loss function in RL actually means. I've seen many implementations of it, but many contradict each other. For my understanding, it means this:
Loss = - log(pi) * A
Where A is the advantage compared to a baseline case. In Google's case, they used a baseline of 0, so A = R. This is multiplied by that specific action at that specific time, so in your above example, actions were one hot encoded as [1, 0, 0]. We will ignore the 0s and only take the 1. Hence we have the above equation.
If you intuitively calculate this loss for a negative reward:
Loss = - (-1) * log(P)
But for any P less than 1, log of that value will be negative. Therefore, you have a negative loss which can be interpreted as "very good", but really doesn't make physical sense.
The correct way:
However in my opinion, and please others correct me if I'm wrong, you do not calculate the loss directly. You take the gradient of the loss. That is, you take the derivative of -log(pi)*A.
Therefore, you would have:
-(d(pi) / pi) * A
Now, if you have a large negative reward, it will translate to a very large loss.
I hope this makes sense.
Related
this is my first post so I will try to detail all the relevant info. If anything is missing please let me know!
I am currently trying to create cnn (based off unet) for image segmentation on grayscale images.
I have created a custom function to calculate dice loss and Binary Cross Entropy loss, see below.
def dice_BCE_coef_loss(y_true, y_pred):
smooth = 1
bce_weight = 0.5
#y_true_f = tensorflow.math.reduce_sum(y_true)
#y_pred_f = tensorflow.math.reduce_sum(y_pred)
y_true_f = tensorflow.reshape(y_true, [-1])
y_pred_f = tensorflow.reshape(y_pred, [-1])
intersection = tensorflow.math.reduce_sum(y_true_f * y_pred_f)
union = tensorflow.math.reduce_sum(y_true_f + y_pred_f)
dice_coef = (2*intersection + smooth) / (union + smooth)
dice_loss = 1 - dice_coef
BCE = tensorflow.keras.losses.BinaryCrossentropy(from_logits=True)(y_true, y_pred)
dice_BCE = tensorflow.math.reduce_mean(BCE * bce_weight + dice_loss * (1 - bce_weight))
return dice_BCE
I then add this to my model as the loss.
model.compile(optimizer=tensorflow.keras.optimizers.Adam(lr=1e-3),
loss=dice_BCE_coef_loss,
metrics=['accuracy']
)
The issue comes when I calculate the dice_BCE manually the loss value is different to the output loss during training. To confirm whether this was a correct value across the whole dataset (my manual check was a single image) I reduced my dataset to a single image and mask yet they still didn't match.
Image showing the discrepancy of loss vs my expected dice_BCE loss (I hope a picture is allowed in this case) 1
This loss always remains around 0.48 after several epochs, however never really improves from there and sometimes you can see the output mask is really close (and a good expected dice_BCE to match) yet it ends up diverging because the loss it seems to train on can be improved in other ways (but increase the expected dice_loss).
The dice loss (through the loss value of the epoch) is also far lower than when calculated through the function. Around 0.001 even when the accuracy of the prediction is ~50% and it visible looks incorrect.
Can anybody explain how this loss is calculated and why it does not match what I expect it to be?
I have read through similar posts on here but can't find anything useful.
Any suggestions of what to look into next or resources to investigate further if this is obvious please do let me know! Thank you in advance
I'm currently training a WGAN in keras with (approx) Wasserstein loss as below:
def wasserstein_loss(y_true, y_pred):
return K.mean(y_true * y_pred)
However, this loss can obviously be negative, which is weird to me.
I trained the WGAN for 200 epochs and got the critic Wasserstein loss training curve below.
The above loss is calculated by
d_loss_valid = critic.train_on_batch(real, np.ones((batch_size, 1)))
d_loss_fake = critic.train_on_batch(fake, -np.ones((batch_size, 1)))
d_loss, _ = 0.5*np.add(d_loss_valid, d_loss_fake)
The resulting generated sample quality is great, so I think I trained the WGAN correctly. However I still cannot understand why the Wasserstein loss can be negative and the model still works. According to the original WGAN paper, Wasserstein loss can be used as a performance indicator for GAN, so how should we interpret it? Am I misunderstand anything?
The Wasserstein loss is a measurement of Earth-Movement distance, which is a difference between two probability distributions. In tensorflow it is implemented as d_loss = tf.reduce_mean(d_fake) - tf.reduce_mean(d_real) which can obviously give a negative number if d_fake moves too far on the other side of d_real distribution. You can see it on your plot where during the training your real and fake distributions changing sides until they converge around zero. So as a performance measurement you can use it to see how far the generator is from the real data and on which side it is now.
See the distributions plot:
P.S. it's crossentropy loss, not Wasserstein.
Perhaps this article can help you more, if you didn't read it yet. However, the other question is how the optimizer can minimize the negative loss (to zero).
Looks like I cannot make a comment to the answer given by Sergeiy Isakov because I do not have enough reputations. I wanted to comment because I think that information is not correct.
In principle, Wasserstein distance cannot be negative because distance metric cannot be negative. The actual expression (dual form) for Wasserstein distance involves the supremum of all the 1-Lipschitz functions (You can refer to it on the web). Since it is the supremum, we always take that Lipschitz function that gives the largest value to obtain the Wasserstein distance. However, the Wasserstein we compute using WGAN is just an estimate and not really the real Wasserstein distance. If the inner iterations of the critic are low it may not have enough iterations to move to a positive value.
Thought experiment: If we suppose that we obtain a Wasserstein estimate that is negative, we can always negate the critic function to make the estimate positive. That means there exist a Lipschitz function that gives a positive value which is larger than that Lipschitz function that gives negative value. So Wasserstein estimates cannot be negative as by definition we need to have the supremum of all the 1-Lipschitz functions.
I've written an LSTM in Keras for univariate time series forecasting. I'm using an input window of size 48 and an output window of size 12, i.e. I'm predicting 12 steps at once. This is working generally well with an optimization metric such as RMSE.
For non-stationary time series I'm differencing the data before feeding the data to the LSTM. Then after predicting, I take the inverse difference of the predictions.
When differencing, RMSE is not suitable as an optimization metric as the earlier prediction steps are a lot more important than later steps. When we do the inverse difference after creating a 12-step forecast, then the earlier (differenced) prediction steps are going to affect the inverse difference of later steps.
So what I think I need is an optimization metric that gives the early prediction steps more weight, preferably exponentially.
Does such a metric exist already or should I write my own? Am I overlooking something?
Just wrote my own optimization metric, it seems to work well, certainly better than RMSE.
Still curious what's the best practice here. I'm relatively new to forecasting.
from tensorflow.keras import backend as K
def weighted_rmse(y_true, y_pred):
weights = K.arange(start=y_pred.get_shape()[1], stop=0, step=-1, dtype='float32')
y_true_w = y_true * weights
y_pred_w = y_pred * weights
return K.sqrt(K.mean(K.square(y_true_w - y_pred_w), axis=-1))
I can't wrap my head around question: how exactly negative rewards helps machine to avoid them?
Origin of the question came from google's solution for game Pong. By their logic, once game finished (agent won or lost point), environment returns reward (+1 or -1). Any intermediate states return 0 as reward. That means each win/loose will return either [0,0,0,...,0,1] either [0,0,0,...,0,-1] reward arrays. Then they discount and standardize rewards:
#rwd - array with rewards (ex. [0,0,0,0,0,0,1]), args.gamma is 0.99
prwd = discount_rewards(rwd, args.gamma)
prwd -= np.mean(prwd)
prwd /= np.std(prwd)
discount_rewards suppose to be some kind of standard function, impl can be found here. Result for win (+1) could be something like this:
[-1.487 , -0.999, -0.507, -0.010, 0.492, 0.999, 1.512]
For loose (-1):
[1.487 , 0.999, 0.507, 0.010, -0.492, -0.999, -1.512]
As result each move gets rewarded. Their loss function looks like this:
loss = tf.reduce_sum(processed_rewards * cross_entropies + move_cost)
Please, help me answer next questions:
Cross entropy function can produce output from 0 -> inf. Right?
Tensorflow optimizer minimize loss by absolute value (doesn't care about sign, perfect loss is always 0). Right?
If statement 2 is correct, then loss 7.234 is equally bad as -7.234. Right?
If everything above is correct, than how negative reward tells machine that it's bad, and positive tells machine that it's good?
I also read this answer, however I still didn't manage to get the idea exactly why negative worse than positive. It makes more sense to me to have something like:
loss = tf.reduce_sum(tf.pow(cross_entropies, reward))
But that experiment didn't went well.
Cross entropy function can produce output from 0 -> inf. Right?
Yes, only because we multiply it by -1. Thinking of the natural sign of log(p). As p is a probability (i.e between 0 and 1), log(p) ranges from (-inf, 0].
Tensorflow optimizer minimize loss by absolute value (doesn't care about sign, perfect loss is always 0). Right?
Nope, the sign matters. It sums up all losses with their signs intact.
If statement 2 is correct, then loss 7.234 is equally bad as -7.234. Right?
See below, a loss of 7.234 is much better than a loss of -7.234 in terms of increasing the reward. The overall positive loss indicates our agent is making a series of good decisions.
If everything above is correct, than how negative reward tells machine that it's bad, and positive tells machine that it's good?
Normalizing Rewards to Generate Returns in reinforcement learning makes a very good point that the signed rewards are there to control the size of the gradient. The positive / negative rewards perform a "balancing" act for the gradient size. This is because a huge gradient from a large loss would cause a large change to the weights. Thus if your agent makes as many mistakes as it does proper moves, the overall update for that batch should not be large.
"Tensorflow optimizer minimize loss by absolute value (doesn't care about sign, perfect loss is always 0). Right?"
Wrong. Minimizing the loss means trying to achieve as small a value as possible. That is, -100 is "better" than 0. Accordingly, -7.2 is better than 7.2. Thus, a value of 0 really carries no special significance, besides the fact that many loss functions are set up such that 0 determines the "optimal" value. However, these loss functions are usually set up to be non-negative, so the question of positive vs. negative values doesn't arise. Examples are cross entropy, squared error etc.
I want to implement an accuracy function for a triplet loss network so that I know, how does the algorithm works during the training. So far I have tried something, but I'm not sure whether it actually can work and also I have troubles implementing it in keras. My idea was to compare the predicted anchor-positive and anchor-negative distances (in y_pred), so that the positive distance should be low enough and the negative one large enough:
def accuracy(_, y_pred):
pos_treshold = 0.4
neg_treshold = 0.6
return K.mean(y_pred[0] < pos_treshold and y_pred[1] > neg_treshold)
The problem with this is that I couldn't figure out how to implement this and condition in keras.
Then I tried to find something on this topic of accuracy for triplet loss. One way of doing it is to define the accuracy as a proportion of the number of triplets in which the predicted distance between the anchor image and the positive image is less than the one between the anchor image and the negative image. With this I have even bigger problems in implementing it in keras.
I tried this (although I don't know whether it does what I described):
K.mean(y_pred[0] < y_pred[1])
which gives me accuracy around 0.5 all the time (probably some random stuff). So still I don't know whether the model is bad or the accuracy function is bad.
So my question is how to implement any reasonable accuracy function in keras? Whether it would be one of these two I don't really care.
That's what I use (condition y_pred[0] < y_pred[1]), while taking into account the batch dimension. Note that I'm not using a mean, so that it would support sample-weight.
def triplet_accuracy(_, y_pred):
'''
Input: y_pred shape is (batch_size, 2)
[pos, neg]
Output: shape (batch_size, 1)
loss[i] = 1 if y_pred[i, 0] < y_pred[i, 1] else 0
'''
subtraction = K.constant([-1, 1], shape=(2, 1))
diff = K.dot(y_pred, subtraction)
loss = K.maximum(K.sign(diff), K.constant(0))
return loss