I am currently building a Conditional GAN to apply data augmentation on a small audio dataset.
My problem is that I don't really know how to calibrate my models and the parameters, I feel like there is a need to fine-tune the hyperparameters in a certain way but I don't know in which direction to go.
First of all, here is a plot of my losses through the epochs, please don't bother with the names of the axis, they are wring becase I reused a function without modifying the name of the axis:
plot of the losses per epochs
As we can see, the two losses cross each other and I believe they should stay balanced and approximately equal for the rest of the training, but in my case, they diverge and never meet again. I was wondering if this is normal behavior, maybe I should stop the training when they cross?
Please tell me if you have any leads, clues, or criticism that would allow me to improve my models.
For further information, here are some of the hyper-parameters I am using:
# I used custom loss functions for both models, each function uses this cross_entropy,
# but I am quite confident that is part is correct.
cross_entropy = tf.keras.losses.BinaryCrossentropy(from_logits=False)
# different learning rates because I felt that the discriminator model was too chaotic
generator_optimizer = Adam(8e-5)
discriminator_optimizer = Adam(2e-5)
BATCH_SIZE = 20
epochs = 1000
I am conscious that 1000 epochs are way too much for this but I wanted to observe the behavior on a large scale.
I built my generator like that:
generator model
And my discriminator model is like that:
discriminator model
The architecture is done using the functional API of Tensorflow
Thanks for reading and please tell me if you see anything funny or if you have any leads.
Related
I implement training and evaluating for binary classification with image data through transfer learning from keras API. I'd like to compare performance each models(ResNet, Inception, Xception, VGG, Efficient Net). The datasets are composed by train(approx.2000ea), valid(approx.250ea), test(approx.250ea).
But I faced unfamiliar situation for me so I'm asking couple of questions here.
As shown below, Valid Accuracy or Loss has a very high up and down deviation.
I wonder which one is the problem and what needs to be changed.
epoch_acc_loss
loss_epoch
acc_epoch
If I want to express validation accuracy with number, what should I say in the above case?
Average or maximum or minimum?
It is being performed using Keras (tensorflow), and there are many examples in the API for
train, valid but the code for Test(evaluation?) is hard to find. When figuring performance,
normally implement until valid? or Do I need to show evaluation result?
Now I use Keras API for transfer learning and set this.
include_top=False
conv_base.trainable=False
Summary
I wonder if there is an effect of transfer learning without includint from top, or if it's not,
is there a way to freeze or learn from a specific layer of conv_base.
I'm a beginner and have not many experience so it could be ridiculous questions but please give kind advice.
Thanks a lot in advance.
It's hard to figure out the problem without any given code/model structure. From your loss graph I can see that your model is facing underfitting (or it has a lots of dropout). Common mistakes, that make models underfit are: very high lr and primitive structure (so model can't figure out the dependencies in your data). And you should never forget about the principle "garbage in - garbage out", so double-check tour data for any structure roughness.
Well, validation accuracy in you model training logs is mean accuracy for validation set. Validation technique is based on statistics - you take random N% out of your set for validation, so average is always better if we're talking about multiple experimets (or cross validation).
I'm not sure if I've understood your question correct here, but if you want to evaluate your model with the metric, that you've specified for it after the training process (fit() function call) you should use model.evaluate(val_x, val_y). Or you may use model.predict(val_x) and compare its results to val_y via your metric function.
If you are using default weights for keras pretrained models (imagenet weights) and you want to use your own fully-connected part with it, you may use ONLY pretrained feature extractor (conv blocks). So you specify include_top=False. Of course there will be some positive effect (I'd say it will be significant in comparison with randomly initialized weights) because conv blocks have params that were trained to extract correct features from image. Also would recommend here to use so called "fine-tuning" technique - freeze all layers in pretrained part except a few in its end (may be few layers or even 2-3 conv blocks). Here's the example of fine-tuning of EfficientNetB0:
effnet = EfficientNetB0(weights="imagenet", include_top=False, input_shape=(540, 960, 3))
effnet.trainable = True
for layer in effnet.layers:
if 'block7a' not in layer.name and 'top' not in layer.name:
layer.trainable = False
Here I freeze all pretrained weights except last conv block ones. I've looked into the model with effnet.summary() and selected names of blocks that I want to unfreeze.
i was trying to use average ensembling on a group of models i trained earlier (i'm creating a new model in the ensemble for each pre-trained model i'm using and then loading the trained weights onto it, it's inefficient this way i know but i'm just learning about it so it doesn't really matter). and I mistakenly changed some of the network's parameters when loading the models in the ensemble code like using Relu instead of leakyRelu which i used in training the models and a different value for an l2 regularizer in the dense layer in one of the models. this however gave me a better testing accuracy for the ensemble. can you please explain to me if/how this is incorrect, and if it's normal can i use this method to further enhance the accuracy of the ensemble.
I believe it is NOT correct to chnage model's parameters after training it. parameters here I mean the trainable-parameters like the weights in Dense node but not hyper-parameters like learning rate.
What is training?
Training essentially is a loop that keeps changing, or update, the parameters. It updates the parameter in such a way that it believes it can reduce the loss. It is also like moving your point in a hyper-spaces that the loss function gives a small loss on that point.
Smaller loss means higher accruacy in general.
Changing Weights
So now, changing your parameters values, by mistake or by purpose, is like moving that point to somewhere BUT you have no logical reason behind that such move will give you a smaller loss. You are just randomly wandering around that hyper-space and in your case you are just lucky that you land to some point that so happened give you a smaller loss or a better testing accuracy. It is just purely luck.
Changing activation function
Also, altering the activation function from leakyRelu to relu is similar you randomly alter the shape of your hype-space. Even though you are at the some point the landscape changes, you are still have no logical reason to believe by such change of landscape you can have a smaller loss staying at the same point
When you change the model manually, you need to retrain.
Though you changed the network's parameters when loading the models. It is not incorrect to alter the hyper-parameters of your ensemble's underlying models. In some cases, the models that are used in an ensemble method require unique tunings which can, as you mentioned, give "you a better testing accuracy for the ensemble model."
To answer your second question, you can use this method to further enhance the accuracy of the ensemble, you can also use Bayesian optimization, GridSearch, and RandomSearch if you prefer more automated means of tuning your hyperparameters.
I am using LSTM for time-series prediction using Keras. I am using 3 LSTM layers with dropout=0.3, hence my training loss is higher than validation loss. To monitor convergence, I using plotting training loss and validation loss together. Results looks like the following.
After researching about the topic, I have seen multiple answers for example ([1][2] but I have found several contradictory arguments on various different places on the internet, which makes me a little confused. I am listing some of them below :
1) Article presented by Jason Brownlee suggests that validation and train data should meet for the convergence and if they don't, I might be under-fitting the data.
https://machinelearningmastery.com/diagnose-overfitting-underfitting-lstm-models/
https://machinelearningmastery.com/learning-curves-for-diagnosing-machine-learning-model-performance/
2) However, following answer on here suggest that my model is just converged :
How do we analyse a loss vs epochs graph?
Hence, I am just bit confused about the whole concept in general. Any help will be appreciated.
Convergence implies you have something to converge to. For a learning system to converge, you would need to know the right model beforehand. Then you would train your model until it was the same as the right model. At that point you could say the model converged! ... but the whole point of machine learning is that we don't know the right model to begin with.
So when do you stop training? In practice, you stop when the model works well enough to do what you want it to do. This might be when validation error drops below a certain threshold. It might just be when you can't afford any more computing power. It's really up to you.
I used fully connected layers via keras api to predict regression values.
To generalize the model, I plotted train and validation loss. I wanted the plot to show me a point that the validation loss is higher than train loss.
However, both loss values were almost similar and no change.
I was wondering.
Is that model trained well?
If the model trained well, how can i interpret the loss plot?
the model performance isn't good. so what should i do for improving the performance about the model?
As we all learned, the validation error are often higher than training loss. So you might feel strange when both of them are the same. In my opinion, your model probably hasn't been well trained. You will still have to try more parameters and have them (and corresponding error on test data set) recorded in a file to help you find the best parameter.
One simple way of finding better parameters is to use a for-loop over one certain parameter with the others fixed, and then record them like below
List = [parameter_1, parameter_2, parameter_3]
Name = 'parameter_1, parameter_2, parameter_3 \n'
f = open('training-log.csv','a')
f.write(Name)
for i in List:
f.write('{},'.format(str(i)))
f.write('\n')
f.close()
Hope this can help you.
I have a dataset of roughly 34000 images divided in 2 sets: train (30000 images) and validation (4000 images) sets. Each image is the result of the difference between two images taken from a video (the time offset between the images in each pair is about 1 second). The videos have a static background so the diff images contains too much black with only one or two small regions with colors. Each diff image has a label (there has been an action or no.. 1 or 0) so this is sort of binary classification. Briefly, I'm using the slim models pretrained on ImageNet to do the finetuning on my dataset. I've launched 5 separated training using 5 different networks: InceptionV4, InceptionResnetV2, Resnet152, NASNet-mobile, NASNet. I got very good results using the first 4 networks InceptionV4, InceptionResnetV2, Resnet152, NASNet-mobile but it was not the case using NASNet. The thing is that the Area Under the ROC curve on the validation set is always = 0.5 and the logits of the validation images are roughly having the same values which is really weird. In fact, I got this kind of results using NASNet-mobile on the first 10000 mini-batch but after that the model did converge. Here are the values of the hyperparameters I have in my script:
batch_size=10
weight_decay = 0.00004
optimizer = rmsprop
rmsprop_momentum = 0.9
rmsprop_decay = 0.9
learning_rate_decay_type = exponential
learning_rate = 0.01
learning_rate_decay_factor = 0.94
num_epochs_per_decay = 2.0 #'Number of epochs after which learning rate
I'm still newbie in tensorflow and I did not find anything related anywhere else. This is a really weird behavior because I'm using the same parameters and same inputs but it seems using NASNet there is a problem somewhere. I'm not only looking for a solution (if possible because I know it is tough to troubleshoot such things without too much details about the model) but insights about where to look and how to troubleshoot would be great. Does anybody had this problem with finetuning NASNet before? something like the model didn't converge for example? Finally, I know it is really hard to got answers on such questions but I hope to get at least some insights so I can move forward with my investigations.
EDIT:
Here are the plots of the cross entropy and regularization losses:
EDIT:
As proposed in the answer, I did set the drop_path_keep_prob params to 1 and now the model converged and I got good accuracy on the validation set. But now the question is: what does this param mean? Is it one of the params that we should adapt to our dataset (like learning rate etc..)?
The simplest sanity check you can do would be to run the finetuning on a single minibatch. Any deep network should be able to overfit to that, if there aren't any big problems. If you see that it can't do that, then there must be some problem with the definition, or the way you're using the definition.
The only guess I have in your case is that it could be something to do with the drop_path implementation. It's disabled in the mobile version, but it is enabled during training on the large model. It could make the model unstable enough that it wouldn't fine tune, so it may be worth trying to train with it disabled.