I am working on very large dataset in Keras with a single-output neural network. Upon a change in depth of the network, I observed some improvements in the performance of the model. Therefore, I wanted to perform ""A systematic"" research-wise hyper-parameter optimization now (hidden layers, activation functions, # neurons, epochs, batch size, etc.). However, I was told that GridSearchCV and RandomSearchCV are not proper options since my dataset is large. I was wondering if any of you have experience in this regard or have feedback which may direct me to the right path.
use a confusion matrix and heat map to measure performance accuracy of your network
Y_pred=model.predict(X_test)
Y_pred2=np.argmax(Y_pred, axis=1)
Y_test2=np.argmax(Y_test, axis=1)
cm = confusion_matrix(Y_test2, Y_pred2)
sns.heatmap(cm)
plt.show()
print(classification_report(Y_test2, Y_pred2,target_names=label_names))
Related
The following content comes from Keras tutorial
This behavior has been introduced in TensorFlow 2.0, in order to enable layer.trainable = False to produce the most commonly expected behavior in the convnet fine-tuning use case.
Why we should freeze the layer when fine-tuning a convolutional neural network? Is it because some mechanisms in tensorflow keras or because of the algorithm of batch normalization? I run an experiment myself and I found that if trainable is not set to false the model tends to catastrophic forgetting what has been learned before and returns very large loss at first few epochs. What's the reason for that?
During training, varying batch statistics act as a regularization mechanism that can improve ability to generalize. This can help to minimize overfitting when training for a high number of iterations. Indeed, using a very large batch size can harm generalization as there is less variation in batch statistics, decreasing regularization.
When fine-tuning on a new dataset, batch statistics are likely to be very different if fine-tuning examples have different characteristics to examples in the original training dataset. Therefore, if batch normalization is not frozen, the network will learn new batch normalization parameters (gamma and beta in the batch normalization paper) that are different to what the other network paramaters have been optimised for during the original training. Relearning all the other network parameters is often undesirable during fine-tuning, either due to the required training time or small size of the fine-tuning dataset. Freezing batch normalization avoids this issue.
I have designed convolutional neural network(tf. Keras) which has few parallel convolutional units with different kernal sizes. Then, each output results of that convolution layers are fed into another convolutional units which are in parallel. Then all the outputs are concatenated. Next flattening is done. After that I added fully connected layer and connected to the final softmax layer for multi class classification. I trained it and had good results in validation test.
However I remove the fully connected layer and accuracy was higher than the previous.
Please someone can explain, how does it happen, it will be very helpful.
Thank you for your valuable time.
Parameters as follows.
When you remove a layer, your model will have less chance of over-fitting the training set. Consequently, by making the network shallower, you make your model more robust to unknown examples and the validation accuracy increases.
Since your training accuracy is also increasing, it can be an indication that -
Exploding or vanishing gradients. You can try solving this problem using careful weight initialization, proper regularization, adding shortcuts, or gradient clipping.
You are not training for enough epochs to learn a deeper network. You can try few more epochs.
You do not have enough data to train a deeper network.
Eliminating the dense layer reduces the tendency for over fitting. Therefore your validation loss should improve as long as your model's training accuracy remains high. Alternatively you can add an additional dropout layer. You can also reduce the tendency for over fitting by using regularizers. Documentation for that is here.
I am working on a classification task which uses byte sequences as samples. A byte sequence can be normalized as input to neural networks by applying x/255 to each byte x. In this way, I trained a simple MLP and the accuracy is about 80%. Then I trained an autoencoder using 'mse' loss on the whole data to see if it works well for the task. I freezed the weights of the encoder's layers and add a softmax dense layer to it for classification. I retrained the new model (only trained the last layer) and to my surprise, the result was much worse than the MLP, merely 60% accuracy.
Can't the autoencoder learn good features from all the data? Why the result is so bad?
Possible actions to take:
Check the error of autoencoder, could it really predict itself?
Visualize the autoencoder results (dimensionality reduction), is the variance explained with fewer dimensions?
Making model more complex does not necessarily outperform simpler ones, did you plot the validation mse versus epoch? Is there a global minimum after a number of steps?
Do you have enough epochs?
What is the number of units you have in your autoencoder? It may be too less (or too much, in case of underfitting) depending on the behavior of your data and its volume.
Did you make any comparison with other dimensionality reduction methods like PCA, NMF?
Last but not least, is it the best way to engineer your features with autoencoder for this task?
"Why the result is so bad?" This is not actually a surprise. You've trained one model to be good at compressing the information. The transformations it learns at each layer do not need to be good for any other type of task at all. In fact, it could be throwing away a lot of information that is perfectly helpful for whatever auxiliary classification task you have, but which is not needed for a task purely of compressing and reconstructing the sequence.
Instead of approaching it by training a separate autoencoder, you might have better luck with just adding sparsity penalty terms from the MLP layers into the loss function, or use some other types of regularization like dropout. Finally you could consider more advanced network architectures, like ResNet / ODE layers or Inception layers, modified for a 1D sequence.
This is more of a deep learning conceptual problem, and if this is not the right platform I'll take it elsewhere.
I'm trying to use a Keras LSTM sequential model to learn sequences of text and map them to a numeric value (a regression problem).
The thing is, the learning always converges too fast on high loss (both training and testing). I've tried all possible hyperparameters, and I have a feeling it's a local minima issue that causes the model's high bias.
My questions are basically :
How to initialize weights and bias given this problem?
Which optimizer to use?
How deep I should extend the network (I'm afraid that if I use a very deep network, the training time will be unbearable and the model variance will grow)
Should I add more training data?
Input and output are normalized with minmax.
I am using SGD with momentum, currently 3 LSTM layers (126,256,128) and 2 dense layers (200 and 1 output neuron)
I have printed the weights after few epochs and noticed that many weights
are zero and the rest are basically have the value of 1 (or very close to it).
Here are some plots from tensorboard :
Faster convergence with a very high loss could possibly mean you are facing an exploding gradients problem. Try to use a much lower learning rate like 1e-5 or 1e-6. You can also try techniques like gradient clipping to limit your gradients in case of high learning rates.
Answer 1
Another reason could be initialization of weights, try the below 3 methods:
Method described in this paper https://arxiv.org/abs/1502.01852
Xavier initialization
Random initialization
For many cases 1st initialization method works the best.
Answer 2
You can try different optimizers like
Momentum optimizer
SGD or Gradient descent
Adam optimizer
The choice of your optimizer should be based on the choice of your loss function. For example: for a logistic regression problem with MSE as a loss function, gradient based optimizers will not converge.
Answer 3
How deep or wide your network should be is again fully dependent on which type of network you are using and what the problem is.
As you said you are using a sequential model using LSTM, to learn sequence on text. No doubt your choice of model is good for this problem you can also try 4-5 LSTMs.
Answer 4
If your gradients are going either 0 or infinite, it is called vanishing gradients or it simply means early convergence, try gradient clipping with proper learning rate and the first weight initialization technique.
I am sure this will definitely solve your problem.
Consider reducing your batch_size.
With large batch_size, it could be that your gradient at some point couldn't find any more variation in your data's stochasticity and for that reason it convergences earlier.
I am using TensorFlow for training model which has 1 output for the 4 inputs. The problem is of regression.
I found that when I use RandomForest to train the model, it quickly converges and also runs well on the test data. But when I use a simple Neural network for the same problem, the loss(Random square error) does not converge. It gets stuck on a particular value.
I tried increasing/decreasing number of hidden layers, increasing/decreasing learning rate. I also tried multiple optimizers and tried to train the model on both normalized and non-normalized data.
I am new to this field but the literature that I have read so far vehemently asserts that the neural network should marginally and categorically work better than the random forest.
What could be the reason behind non-convergence of the model in this case?
If your model is not converging it means that the optimizer is stuck in a local minima in your loss function.
I don't know what optimizer you are using but try increasing the momentum or even the learning rate slightly.
Another strategy employed often is the learning rate decay, which reduces your learning rate by a factor every several epochs. This can also help you not get stuck in a local minima early in the training phase, while achieving maximum accuracy towards the end of training.
Otherwise you could try selecting an adaptive optimizer (adam, adagrad, adadelta, etc) that take care of the hyperparameter selection for you.
This is a very good post comparing different optimization techniques.
Deep Neural Networks need a significant number of data to perform adequately. Be sure you have lots of training data or your model will overfit.
A useful rule for beginning training models, is not to begin with the more complex methods, for example, a Linear model, which you will be able to understand and debug more easily.
In case you continue with the current methods, some ideas:
Check the initial weight values (init them with a normal distribution)
As a previous poster said, diminish the learning rate
Do some additional checking on the data, check for NAN and outliers, the current models could be more sensitive to noise. Remember, garbage in, garbage out.