I implemented a simple neural network for classification (one class) of images in python. Layers are simple (image_matrix, 5,1). Using relu and sigmoid for the hidden layers.
I am iterating 5000 times. At first it looks like the cost goes down gradually in a sensible way.
However, no matter how many training examples I use, or what my learning_rate is, the costs starts behaving erratically after around 3000 iterations every time...
cost (click to see image)
Can someone help me understand what's going on?
Thanks
In training models, you should remember that their are multiple local minima for the its cost. Your graph shows that you're cost is moving around this local minima whilst finding your global minimum, which is the goal finding the best performance for a model.
1st - you should probably try checking for accuracy, f1-score, or loss per iteration/epoch to check if the performance is actually improving.
2nd - do cross validation and check for same metrics above for validation
3rd - implement an early stopping function that should check if you're model is improving or not.
*note: find the best alpha that would help you find the global minimum better.
Related
I had implemented a CNN with 3 Convolutional layers with Maxpooling and dropout after each layer
I had noticed that when I trained the model for the first time it gave me 88% as testing accuracy but after retraining it for the second time successively, with the same training dataset it gave me 92% as testing accuracy.
I could not understand this behavior, is it possible that the model had overfitting in the second training process?
Thank you in advance for any help!
It is quite possible if you have not provided the seed number set.seed( ) in the R language or tf.random.set_seed(any_no.) in python
Well I am no expert when it comes to machine learning but I do know the math behind it. What you are doing when you train a neural network you basicly find the local minima to the loss function. What this means is that the end result will heavily depend on the initial guess of all of the internal varaibles.
Usually the variables are randomized as a initial estimation and you could therefore reach quite different results from running the training process multiple times.
That being said, from when I studied the subject I was told that you usually reach similar regardless of the initial guess of the parameters. However it is hard to say if 0.88 and 0.92 would be considered similar or not.
Hope this gives a somewhat possible answer to your question.
As mentioned in another answer, you could remove the randomization, both in the parameter initialization of the parameters and the randomization of the data used for each epoch of training by introducing a seed. This would insure that when you run it twice, everything will get "randomized" in the exact same order. In tensorflow this is done using for example tf.random.set_seed(1), the number 1 can be changed to any number to get a new seed.
I'm implementing a residual cnn(modified smaller version of xception) in a low latency environment. I've done a lot of manual tuning to minimize the run time speed of my network (reducing number of filters, removing layers, etc).
But now I want to try allowing my network to make its classification prediction(final fcnn layer) on the residual connection after each residual block.
basic logic-
attempt final prediction with residual connection as input
if this fcnn layer predicts a certain class with a probability > a set threshold:
return fcnn output as if it was normal final layer
else:
do next residual block like normal and try the previous conditional again unless we are already at final block
My hope is this will allow my network to learn to solve easier problems with less computation while allowing it to still do the additional layers if it is still unsure of the classification.
So my basic question is: In pytorch, whats the best way to implement this conditional in a way that allows my nn at run time to decide whether to do more processing or not
Currently Ive tested returning the intermediate x's after the blocks in the forward function, but I dont know how best to setup the conditional to chose which x to return
Also side note: I believe I may end up needing another cnn layer between the residual and fcnn to serve as a function to convert the internal representation for processing to a representation the fcnn understands for classification.
It has already been done and presented in ICLR 2018.
It appears as if in ResNets the first few bottlenecks learn representations (and therefore cannot be skipped) while the remaining bottlenecks refine the features and therefore can be skipped at a moderate loss of accuracy. (Stanisław Jastrzebski, Devansh Arpit, Nicolas Ballas, Vikas Verma, Tong Che, Yoshua Bengio Residual Connections Encourage Iterative Inference, ICLR 2018).
This idea was taken to the extreme with sharing weights across bottlenecks in Sam Leroux, Pavlo Molchanov, Pieter Simoens, Bart Dhoedt, Thomas Breuel, Jan Kautz IamNN: Iterative and Adaptive Mobile Neural Network for efficient image classification, ICLR 2018.
Was just wondering if anyone else has encountered this behavior before. I couldn't find this specific issue being brought up in my searches. I wrote my own neural net (a very simple one) using just numpy, and it appears to work, the cost function decreases as I iterate over it. However, when I change the random initialization on the weights from using np.random.randn(shape) to np.random.randn(shape)*0.01 (I heard that using smaller initial weights might speed up learning because I'm using a sigmoid layer), my cost function starts at .69~ln(2) and pretty much gets stuck there. This happens no matter how many times I restart the neural net, and no matter what kind of inputs I'm putting into it. I find this very odd indeed. I should add, that if I start off without the 0.01 multiplication factor, the cost function will decrease to less than .69.
The neural net uses a cross-entropy cost function, and uses gradient descent to make steps. No regularization has been implemented. This behavior doesn't seem to depend on what the neural network's dimensions (number of layers, neurons per layer) are, or the learning rate, only on whether I initialize the starting weights with or without multiplying them by 0.01.
I am using TensorFlow for training model which has 1 output for the 4 inputs. The problem is of regression.
I found that when I use RandomForest to train the model, it quickly converges and also runs well on the test data. But when I use a simple Neural network for the same problem, the loss(Random square error) does not converge. It gets stuck on a particular value.
I tried increasing/decreasing number of hidden layers, increasing/decreasing learning rate. I also tried multiple optimizers and tried to train the model on both normalized and non-normalized data.
I am new to this field but the literature that I have read so far vehemently asserts that the neural network should marginally and categorically work better than the random forest.
What could be the reason behind non-convergence of the model in this case?
If your model is not converging it means that the optimizer is stuck in a local minima in your loss function.
I don't know what optimizer you are using but try increasing the momentum or even the learning rate slightly.
Another strategy employed often is the learning rate decay, which reduces your learning rate by a factor every several epochs. This can also help you not get stuck in a local minima early in the training phase, while achieving maximum accuracy towards the end of training.
Otherwise you could try selecting an adaptive optimizer (adam, adagrad, adadelta, etc) that take care of the hyperparameter selection for you.
This is a very good post comparing different optimization techniques.
Deep Neural Networks need a significant number of data to perform adequately. Be sure you have lots of training data or your model will overfit.
A useful rule for beginning training models, is not to begin with the more complex methods, for example, a Linear model, which you will be able to understand and debug more easily.
In case you continue with the current methods, some ideas:
Check the initial weight values (init them with a normal distribution)
As a previous poster said, diminish the learning rate
Do some additional checking on the data, check for NAN and outliers, the current models could be more sensitive to noise. Remember, garbage in, garbage out.
After fixing my code and prepare my data for training I've found myself in front of 2 question.
Background:
I have data made of date (one entry per minute) for the first column and congestion (value, between 0 and 200) for the 2nd. My goal is to feed it to my neural network and so be able to predict for the next week the congestion at each minute (my dataset is more than 10M of entry, I shouldn't have problem of lack of data for training).
Problem:
I now have two question. First about the loss, optimizer and linear. It seem there is a certain number of them and they all have a domain where they are better than the other, which one would you recommend for this project? (Currently on my test I use Adam as an optimizer and mean_square as loss and linear for activation).
My second question is more like an error that I have (may be linked to me using the wrong loss/optimizer). When using my code (10 000 data of training for now) I have an accuracy of 0, a low loss (0.00X) and a bad prediction (not even close to the reality). Do you have any idea of where it could come from?
What you are trying to do is called time series prediction (given data at time t-n, t-(n+1) ... t-1: predict the state at time t) and is generally a task for a recurrent neural network. Here is the great blog post by Andrej Karpathy about the topic that you should have a look at.
About your two questions:
This is hard to answer since the question of what optimizer to use highly depends on the input data. Generally speaking the network will converge no matter what optimizer you use. The time it takes to converge will differ however. Adaptive learning-rate methods, like Adagrad, Adadelta, and Adam tend to achieve convergence slightly faster. Here is a good write-up of the different optimizers.
Basic neural networks (MLPs) don't do well with time series prediction. That would be an explanation for the low accuracy. However I don't know why the loss would be 0.