Loss functions in GANs - python

I'm trying to build a simple mnist GAN and need less to say, it didn't work. I've searched a lot and fixed most of my code. Though I can't really understand how loss functions are working.
This is what I did:
loss_d = -tf.reduce_mean(tf.log(discriminator(real_data))) # maximise
loss_g = -tf.reduce_mean(tf.log(discriminator(generator(noise_input), trainable = False))) # maxmize cuz d(g) instead of 1 - d(g)
loss = loss_d + loss_g
train_d = tf.train.AdamOptimizer(learning_rate).minimize(loss_d)
train_g = tf.train.AdamOptimizer(learning_rate).minimize(loss_g)
I get -0.0 as my loss value. Can you explain how to deal with loss functions in GANs?

It seems you try to sum the generator and discriminator losses together which is completely wrong!
Since the Discriminator train with both real and generated data you have to create two distinct losses, one for real data and other one for noise data(generated) that you pass into the discriminator network.
Try to change your code as follows:
1)
loss_d_real = -tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=discriminator(real_data),labels= tf.ones_like(discriminator(real_data))))
2)
loss_d_fake=-tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=discriminator(noise_input),labels= tf.zeros_like(discriminator(real_data))))
then the discriminator loss will be equal to = loss_d_real+loss_d_fake.
Now create loss for your generator:
3)
loss_g= tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=discriminator(genereted_samples), labels=tf.ones_like(genereted_samples)))

Maryam seems to have identified the cause of your spurious loss values(i.e. summing the generator and discriminator losses). Just wanted to add that you should probably opt for the Stochastic Gradient Descent optimizer for the discriminator in lieu of Adam - doing so provides stronger theoretical guarantees of the network's convergence when playing the minimax game(source: https://github.com/soumith/ganhacks).

Related

Tensorflow Custom Regularization Term comparing the Prediction to the True value

Hello I am in need of a custom regularization term to add to my (binary cross entropy) Loss function. Can somebody help me with the Tensorflow syntax to implement this?
I simplified everything as much as possible so it could be easier to help me.
The model takes a dataset 10000 of 18 x 18 binary configurations as input and has a 16x16 of a configuration set as output. The neural network consists only of 2 Convlutional layer.
My model looks like this:
import tensorflow as tf
from tensorflow.keras import datasets, layers, models
EPOCHS = 10
model = models.Sequential()
model.add(layers.Conv2D(1,2,activation='relu',input_shape=[18,18,1]))
model.add(layers.Conv2D(1,2,activation='sigmoid',input_shape=[17,17,1]))
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3),loss=tf.keras.losses.BinaryCrossentropy())
model.fit(initial.reshape(10000,18,18,1),target.reshape(10000,16,16,1),batch_size = 1000, epochs=EPOCHS, verbose=1)
output = model(initial).numpy().reshape(10000,16,16)
Now I wrote a function which I'd like to use as an aditional regularization terme to have as a regularization term. This function takes the true and the prediction. Basically it multiplies every point of both with its 'right' neighbor. Then the difference is taken. I assumed that the true and prediction term is 16x16 (and not 10000x16x16). Is this correct?
def regularization_term(prediction, true):
order = list(range(1,4))
order.append(0)
deviation = (true*true[:,order]) - (prediction*prediction[:,order])
deviation = abs(deviation)**2
return 0.2 * deviation
I would really appreciate some help with adding something like this function as a regularization term to my loss for helping the neural network to train better to this 'right neighbor' interaction. I'm really struggling with using the customizable Tensorflow functionalities a lot.
Thank you, much appreciated.
It is quite simple. You need to specify a custom loss in which you define your adding regularization term. Something like this:
# to minimize!
def regularization_term(true, prediction):
order = list(range(1,4))
order.append(0)
deviation = (true*true[:,order]) - (prediction*prediction[:,order])
deviation = abs(deviation)**2
return 0.2 * deviation
def my_custom_loss(y_true, y_pred):
return tf.keras.losses.BinaryCrossentropy()(y_true, y_pred) + regularization_term(y_true, y_pred)
model.compile(optimizer='Adam', loss=my_custom_loss)
As stated by keras:
Any callable with the signature loss_fn(y_true, y_pred) that returns
an array of losses (one of sample in the input batch) can be passed to
compile() as a loss. Note that sample weighting is automatically
supported for any such loss.
So be sure to return an array of losses (EDIT: as I can see now it is possible to return also a simple scalar. It doesn't matter if you use for example the reduce function). Basically y_true and y_predicted have as first dimension the batch size.
here details: https://keras.io/api/losses/

How to implement gradient ascent in a Keras DQN

Have built a Reinforcement Learning DQN with variable length sequences as inputs, and positive and negative rewards calculated for actions. Some problem with my DQN model in Keras means that although the model runs, average rewards over time decrease, over single and multiple cycles of epsilon. This does not change even after significant period of training.
My thinking is that this is due to using MeanSquareError in Keras as the Loss function (minimising error). So I am trying to implement gradient ascent (to maximise reward). How to do this in Keras? My current model is:
model = Sequential()
inp = (env.NUM_TIMEPERIODS, env.NUM_FEATURES)
model.add(Input(shape=inp)) # 'a shape tuple(integers), not including batch-size
model.add(Masking(mask_value=0., input_shape=inp))
model.add(LSTM(env.NUM_FEATURES, input_shape=inp, return_sequences=True))
model.add(LSTM(env.NUM_FEATURES))
model.add(Dense(env.NUM_FEATURES))
model.add(Dense(4))
model.compile(loss='mse,
optimizer=Adam(lr=LEARNING_RATE, decay=DECAY),
metrics=[tf.keras.losses.MeanSquaredError()])
In trying to implement gradient ascent, by 'flipping' the gradient (as negative or inverse loss?), I have tried various loss definitions:
loss=-'mse'
loss=-tf.keras.losses.MeanSquaredError()
loss=1/tf.keras.losses.MeanSquaredError()
but these all generate bad operand [for unary] errors.
How to adapt current Keras model to maximise rewards ?
Or is this gradient ascent not even the problem? Could it be some issue with the action policy?
Writing a custom loss function
Here is the loss function you want
#tf.function
def positive_mse(y_true, y_pred):
return -1 * tf.keras.losses.MSE(y_true, y_pred)
And then your compile line becomes
model.compile(loss=positive_mse,
optimizer=Adam(lr=LEARNING_RATE, decay=DECAY),
metrics=[tf.keras.losses.MeanSquaredError()])
Please note : use loss=positive_mse and not loss=positive_mse(). That's not a typo. This is because you need to pass the function, not the results of executing the function.

Diverging loss in Keras with custom loss

I have a fully connected feed-forward network implemented with Keras. Initially, I used binary cross-entropy as the loss and the metric, and Adam optimizer as follows
adam = keras.optimizers.Adam(lr=0.01, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0, amsgrad=False)
model.compile(optimizer=adam, loss='binary_crossentropy', metrics=['binary_crossentropy'])
This model trains well and gives good results. In order to get better results I want to use a different loss function and metric as below,
import keras.backend as K
def soft_bit_error_loss(yTrue, yPred):
loss = K.pow(1 - yPred, yTrue) * K.pow(yPred, 1-yTrue)
return K.mean(loss)
def ber(yTrue, yPred):
x_hat_train = K.cast(K.greater(yPred, 0.5), 'uint8')
train_errors = K.cast(K.not_equal(K.cast(yTrue, 'uint8'), x_hat_train), 'float32')
train_ber = K.mean(train_errors)
return train_ber
I use it to compile my model as below
model.compile(optimizer=adam, loss=soft_bit_error_loss, metrics=[ber])
However, when I do that, the loss and the metric diverge after some training, everytime as in the following pictures.
What can be the cause of this?
Your loss function is very unstable, look at it:
Where I replaced y_pred (variable) with x and y_true (constant) with c for simplicity.
As your predictions approach zero, at least one operation will tend to 1/0, which is infinite. Although by the limits theory you can know the result is ok, Keras doesn't know the "whole" function as one, it calculates derivatives based on the basic operations used.
So, one easy solution is the one pointed by #today:
loss = K.switch(yTrue == 1, 1 - yPred, yPred)
It's exactly the same function (difference only when c is not zero or 1).
Also, even easier, for c=0 or c=1, it's just a plain loss='mae'.

Numerical equivalence of PyTorch backpropagation

After i 'v written the simple neural network with numpy, i wanted to compare it numerically with PyTorch impementation. Running alone, seems my neural network implementation converges, so it seems to have no errors.
Also i v checked forward pass matches to PyTorch, so basic setup is correct.
But something different happens while backward pass, because the weights after one backpropagation are different.
I dont want to post full code here because its linked over several .py files, and most of the code is irrelevant to the question. I just want to know does PyTorch "basic" gradient descent or something different.
I m viewing the most simle example about full-connected weights of the last layer, cause if it is different, further will be also different:
self.weight += self.learning_rate * hidden_layer.T.dot(output_delta )
where
output_delta = self.expected - self.output
self.expected are expected value,
self.output is forward pass result
No activation or further stuff here.
The torch past is:
optimizer = torch.optim.SGD(nn.parameters() , lr = 1.0)
criterion = torch.nn.MSELoss(reduction='sum')
output = nn.forward(x_train)
loss = criterion(output, y_train)
loss.backward()
optimizer.step()
optimizer.zero_grad()
So it is possible that with SGD optimizer and MSELoss it uses some different delta or backpropagation function, not the basic one mentioned above? If its so i d like to know how to numerically check my numpy solution with pytorch.
I just want to know does PyTorch "basic" gradient descent or something different.
If you set torch.optim.SGD, this means stochastic gradient descent.
You have different implementations on GD, but the one that is used in PyTorch is applied to mini-batches.
There are GD implementations that will optimize parameters after the full epoch. As you may guess they are very "slow", this may be great for supercomputers to test. There are GD implementations that work for every sample, as you may guess their imperfectness is "huge" gradient fluctuations.
These are all relative terms, so I am using ""
Note you are using too big learning rates like lr = 1.0, which means you haven't normalized your data at first, but this is a skill you may scalp over time.
So it is possible that with SGD optimizer and MSELoss it uses some different delta or backpropagation function, not the basic one mentioned above?
It uses what you told.
Here is a the example in PyTorch and in Python to show detection of gradients works as expected (used in back propagation) :
x = torch.tensor([5.], requires_grad=True);
print(x) # tensor([5.], requires_grad=True)
y = 3*x**2
y.backward()
print(x.grad) # tensor([30.])
How would you get this value 30 in plain python?
def y(x):
return 3*x**2
x=5
e=0.01 #etha
g=(y(x+e)-y(x))/e
print(g) # 30.0299
As we expect we got ~30, it would be even better with smaller etha.

Softmax Cross Entropy loss explodes

I am creating a deep convolutional neural network for pixel-wise classification. I am using adam optimizer, softmax with cross entropy.
Github Repository
I asked a similar question found here but the answer I was given did not result in me solving the problem. I also have a more detailed graph of what it going wrong. Whenever I use softmax, the problem in the graph occurs. I have done many things such as adjusting training and epsilon rates, trying different optimizers, etc. The loss never decreases past 500. I do not shuffle my data at the moment. Using sigmoid in place of softmax results in this problem not occurring. However, my problem has multiple classes, so the accuracy of sigmoid is not very good. It should also be mentioned that when the loss is low, my accuracy is only about 80%, I need much better than this. Why would my loss suddenly spike like this?
x = tf.placeholder(tf.float32, shape=[None, 7168])
y_ = tf.placeholder(tf.float32, shape=[None, 7168, 3])
#Many Convolutions and Relus omitted
final = tf.reshape(final, [-1, 7168])
keep_prob = tf.placeholder(tf.float32)
W_final = weight_variable([7168,7168,3])
b_final = bias_variable([7168,3])
final_conv = tf.tensordot(final, W_final, axes=[[1], [1]]) + b_final
cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=final_conv))
train_step = tf.train.AdamOptimizer(1e-5).minimize(cross_entropy)
correct_prediction = tf.equal(tf.argmax(final_conv, 2), tf.argmax(y_, 2))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
You need label smoothing.
I just had the same problem. I was training with tf.nn.sparse_softmax_cross_entropy_with_logits which is the same as if you use tf.nn.softmax_cross_entropy_with_logits with one-hot labels. My dataset predicts the occurrence of rare events so the labels in the training set are 99% class 0 and 1% class 1. My loss would start to fall, then stagnate (but with reasonable predictions), then suddenly explode and then the predictions also went bad.
Using the tf.summary ops to log internal network state into Tensorboard, I observed that the logits were growing and growing in absolute value. Eventually at >1e8, tf.nn.softmax_cross_entropy_with_logits became numerically unstable and that's what generated those weird loss spikes.
In my opinion, the reason why this happens is with the softmax function itself, which is in line with Jai's comment that putting a sigmoid in there before the softmax will fix things. But that will quite surely also make it impossible for the softmax likelihoods to be accurate, as it limits the value range of the logits. But in doing so, it prevents the overflow.
Softmax is defined as likelihood[i] = tf.exp(logit[i]) / tf.reduce_sum(tf.exp(logit[!=i])). Cross-entropy is defined as tf.reduce_sum(-label_likelihood[i] * tf.log(likelihood[i]) so if your labels are one-hot, that reduces to just the negative logarithm of your target likelihood. In practice, that means you're pushing likelihood[true_class] as close to 1.0 as you can. And due to the softmax, the only way to do that is if tf.exp(logit[!=true_class]) becomes as close to 0.0 as possible.
So in effect, you have asked the optimizer to produce tf.exp(x) == 0.0 and the only way to do that is by making x == - infinity. And that's why you get numerical instability.
The solution is to "blur" the labels so instead of [0,0,1] you use [0.01,0.01,0.98]. Now the optimizer works to reach tf.exp(x) == 0.01 which results in x == -4.6 which is safely inside the numerical range where GPU calculations are accurate and reliably.
Not sure, what it causes it exactly. I had the same issue a few times. A few things generally help: You might reduce the learning rate, ie. the bound of the learning rate for Adam (eg. 1e-5 to 1e-7 or so) or try stochastic gradient descent. Adam tries to estimate learning rates which can lead to instable training: See Adam optimizer goes haywire after 200k batches, training loss grows
Once I also removed batchnorm and that actually helped, but this was for a "specially" designed network for stroke data (= point sequences), which was not very deep with Conv1d layers.

Categories