Practical Uncertainties in "Reverse" NLLLoss for Generative Adversarial Network - python

I've been attempting to implement my own GAN with some limited successes. I tweaked a bit of how they trained it in the tutorial and was wondering if this more drastic change is viable. The first change I made was to make the discriminator classify n+1 classes where n could be, for example 10 in MNIST, and the n+1th class is the fake class. The discriminator was an imported architecture I made from scratch for a really good classifier. Then my GAN will have the "opposite" of the traditional NLLLoss.
This is the tricky non-traditional calculation part. So because I have a softmax in the last layer of my discriminator, my output from the discriminator will always be from 0 to 1. So I can create a custom loss function for the generator to be a horizontal flip of the NLLLoss which tries to make sure the discriminator does not classify the fakes as the n+1th class. The idea is that I don't care what class the fakes are classified as long as they're not the n+1th class. This behavior of misclassification is what I want to maximize for the generator.
Here is the function I plotted on desmos to give some visualization:
https://www.desmos.com/calculator/6gdqs28ihk
my actual code for the generator loss function is the code below while my discriminator loss function is the traditional NLLLoss
loss_G = torch.mean(-torch.log(1 - outputG.float()[:,classes]))
Please let me know if this is completely wrong or there is an easier way.

Related

How to fine-tune a CGAN?

I am currently building a Conditional GAN to apply data augmentation on a small audio dataset.
My problem is that I don't really know how to calibrate my models and the parameters, I feel like there is a need to fine-tune the hyperparameters in a certain way but I don't know in which direction to go.
First of all, here is a plot of my losses through the epochs, please don't bother with the names of the axis, they are wring becase I reused a function without modifying the name of the axis:
plot of the losses per epochs
As we can see, the two losses cross each other and I believe they should stay balanced and approximately equal for the rest of the training, but in my case, they diverge and never meet again. I was wondering if this is normal behavior, maybe I should stop the training when they cross?
Please tell me if you have any leads, clues, or criticism that would allow me to improve my models.
For further information, here are some of the hyper-parameters I am using:
# I used custom loss functions for both models, each function uses this cross_entropy,
# but I am quite confident that is part is correct.
cross_entropy = tf.keras.losses.BinaryCrossentropy(from_logits=False)
# different learning rates because I felt that the discriminator model was too chaotic
generator_optimizer = Adam(8e-5)
discriminator_optimizer = Adam(2e-5)
BATCH_SIZE = 20
epochs = 1000
I am conscious that 1000 epochs are way too much for this but I wanted to observe the behavior on a large scale.
I built my generator like that:
generator model
And my discriminator model is like that:
discriminator model
The architecture is done using the functional API of Tensorflow
Thanks for reading and please tell me if you see anything funny or if you have any leads.

is it incorrect to change a model's parameters after training it?

i was trying to use average ensembling on a group of models i trained earlier (i'm creating a new model in the ensemble for each pre-trained model i'm using and then loading the trained weights onto it, it's inefficient this way i know but i'm just learning about it so it doesn't really matter). and I mistakenly changed some of the network's parameters when loading the models in the ensemble code like using Relu instead of leakyRelu which i used in training the models and a different value for an l2 regularizer in the dense layer in one of the models. this however gave me a better testing accuracy for the ensemble. can you please explain to me if/how this is incorrect, and if it's normal can i use this method to further enhance the accuracy of the ensemble.
I believe it is NOT correct to chnage model's parameters after training it. parameters here I mean the trainable-parameters like the weights in Dense node but not hyper-parameters like learning rate.
What is training?
Training essentially is a loop that keeps changing, or update, the parameters. It updates the parameter in such a way that it believes it can reduce the loss. It is also like moving your point in a hyper-spaces that the loss function gives a small loss on that point.
Smaller loss means higher accruacy in general.
Changing Weights
So now, changing your parameters values, by mistake or by purpose, is like moving that point to somewhere BUT you have no logical reason behind that such move will give you a smaller loss. You are just randomly wandering around that hyper-space and in your case you are just lucky that you land to some point that so happened give you a smaller loss or a better testing accuracy. It is just purely luck.
Changing activation function
Also, altering the activation function from leakyRelu to relu is similar you randomly alter the shape of your hype-space. Even though you are at the some point the landscape changes, you are still have no logical reason to believe by such change of landscape you can have a smaller loss staying at the same point
When you change the model manually, you need to retrain.
Though you changed the network's parameters when loading the models. It is not incorrect to alter the hyper-parameters of your ensemble's underlying models. In some cases, the models that are used in an ensemble method require unique tunings which can, as you mentioned, give "you a better testing accuracy for the ensemble model."
To answer your second question, you can use this method to further enhance the accuracy of the ensemble, you can also use Bayesian optimization, GridSearch, and RandomSearch if you prefer more automated means of tuning your hyperparameters.

What kind of tricks we can play with to further refine the trained neural network model so that it has lower objective function value?

I ask this question because many deep learning frameworks, such as Caffe, supports model refining function. For example, in Caffe, we can use snapshot to initialling the neural network parameters and then continue performing training as the following command shows:
./caffe train -solver solver_file.prototxt -snapshot snap_file.solverstate
In order to further train the model, the following tricks I can play with:
use smaller learning rate
change optimisation method. For example, change stochastic gradient descent to ADAM algorithm
Any other tricks I can play with?
ps: I understand that reducing the loss function value of the training samples does not mean that we can get a better model.
The question is way too broad, I think. However, this is a common practice, especially in case of a small training set. I would rank possible methods like this:
smaller learning rate
more/different data augmentation
add noise to train set (related to data augmentation, indeed)
fine-tune on subset of the training set.
The very last one is indeed a very powerful method to finalize the model that performs poor on some corner cases. You can then make a 'difficult' train subset in order to bias model towards it. I personally use it very often.

neural nets - How can I associate a confidence to my loss function?

I am trying doing OCC (one class classification) using an autoencoder based neural network.
To make a long story short I train my neural network with 200 matrices each containing 128 dataelements. Those are then compressed (see autoencoder).
Once the training is done I pass a new matrix to my neural net (test data) and based on the loss function I know whether the data I passed to it belongs to the target class or not.
I would like to know how I can compute a classification confidence in % based on the loss function I obtain when passing test data.
Thanks
In case it helps I am using Tensorflow
Well actually normally you try to minimize your cost function (or in the case of one training observation your loss function). Normally the probability of the class you want to predict is not done using the loss function, but using a sigmoid output layer for example. You need a function that goes from 0 to 1 and that behaves like a probability. Where did you get the idea of using the loss function to evaluate your proabibility? But I am not an expert in one class classification (or outliers detection)... I guess you actually want the probability of your observation of not belonging to your class right?

Convolutional Neural Network accuracy with Lasagne (regression vs classification)

I have been playing with Lasagne for a while now for a binary classification problem using a Convolutional Neural Network. However, although I get okay(ish) results for training and validation loss, my validation and test accuracy is always constant (the network always predicts the same class).
I have come across this, someone who has had the same problem as me with Lasagne. Their solution was to setregression=True as they are using Nolearn on top of Lasagne.
Does anyone know how to set this same variable within Lasagne (as I do not want to use Nolearn)? Further to this, does anyone have an explanation as to why this needs to happen?
Looking at the code of the NeuralNet class from nolearn, it looks like the parameter regression is used in various places, but most of the times it affects how the output value and loss are computed.
In case of regression=False (the default), the network outputs the class with the maximum probability, and computes the loss with the categorical crossentropy.
On the other hand, in case of regression=True, the network outputs the probabilities of each class, and computes the loss with the squared error on the output vector.
I am not an expert in deep learning and CNN, but the reason this may have worked is that in case of regression=True and if there is a small error gradient, applying small changes to the network parameters may not change the predicted class and the associated loss, and may lead the algorithm to "think" that it has converged. But if instead you look at the class probabilities, small parameter changes will affect the probabilities and the resulting mean squared error, and the network will continue down this path which may eventually change the predictions.
This is just a guess, it is hard to tell without seeing the code and the dataset.

Categories