I want to build my first neural net for recognazing hand-written digits in python, but I cant find a good, simple termination rule?
What I mean by "termination rule" is when to stop updating my weights and biases, or how to know I've reached a lcal min.
Let me be clear. I’m not looking for the best performing-most advanced-most sophisticated rule. On the contrary, I want the simplest, easiest to implement, good to get started rule that will just get the job reasonably done.
If there is any more information requierd for you to answer, please do tell and I'll add it here.
Though the question is somewhat too broad, I'll try to provide you with general guidance.
Neural network training is the process of optimization of high-dimensional (almost always) non-convex loss function. As a result, it's very rare to have a formal proof about its global or local minima or convergence speed. There're mere observations that, for instance, all local minima are approximately the same in terms of test accuracy (loss), which makes the learning process easier as finding the global minimum is no longer mandatory.
The "termination rule" you're asking about is in the same category: it's a general rule that seems to be working in most cases. When you're doing cross-validation, you should stop training when the validation accuracy (loss) stops improving and goes flat or gets worse for some period of time. The result model can be reasonably selected as the best over the whole training process. One can also apply early-stopping (see this and this question), to save training time and still avoid overfitting. Essentially, in practice, the researches let the network train as long as the time limit allows and increase the number of epochs only if the accuracy (loss) still does not look flat, which is rare.
For instance, on the chart below, 10 epochs is too early to stop, because there's a lot of potential for improvement. It's still unclear after 15 epochs. It's ok to stop after 20 epochs, if there's lack of time, but I'd let it run until epoch 25 to be sure. At this point, the training score is almost 1.0, validation score is flat, i.e., no sign it could improve further.
Related
I had implemented a CNN with 3 Convolutional layers with Maxpooling and dropout after each layer
I had noticed that when I trained the model for the first time it gave me 88% as testing accuracy but after retraining it for the second time successively, with the same training dataset it gave me 92% as testing accuracy.
I could not understand this behavior, is it possible that the model had overfitting in the second training process?
Thank you in advance for any help!
It is quite possible if you have not provided the seed number set.seed( ) in the R language or tf.random.set_seed(any_no.) in python
Well I am no expert when it comes to machine learning but I do know the math behind it. What you are doing when you train a neural network you basicly find the local minima to the loss function. What this means is that the end result will heavily depend on the initial guess of all of the internal varaibles.
Usually the variables are randomized as a initial estimation and you could therefore reach quite different results from running the training process multiple times.
That being said, from when I studied the subject I was told that you usually reach similar regardless of the initial guess of the parameters. However it is hard to say if 0.88 and 0.92 would be considered similar or not.
Hope this gives a somewhat possible answer to your question.
As mentioned in another answer, you could remove the randomization, both in the parameter initialization of the parameters and the randomization of the data used for each epoch of training by introducing a seed. This would insure that when you run it twice, everything will get "randomized" in the exact same order. In tensorflow this is done using for example tf.random.set_seed(1), the number 1 can be changed to any number to get a new seed.
I am pretty new to doc2vec then I made small research and found a couple of things. Here is my story: I am trying to learn using doc2vec 2.4 million documents. At first, I tried only doing so with a small model of 12 documents. I checked the results with infer vector of the first document and found it to be similar indeed to the first document by 0.97-0.99 cosine similarity measure. Which I found good, even though when I tried to enter a new document of completely different words I received a high score of 0.8 measure similarity. However, I had put it aside and tried to go on and build the full model with the 2.4 million documents. In this point, my problems began. The result was complete nonsense, I received in the most_similar function results with a similarity of 0.4-0.5 which were completely different from the new document checked. I tried to tune parameters but no result yet. I tried also to remove randomness both from the small and big model, however, I still got different vectors. Then I had tried to use get_latest_training_loss on each epoch in order to see how the loss changes over each epoch. This is my code:
model = Doc2Vec(vector_size=300, alpha=0.025, min_alpha=0.025, pretrained_emb=".../glove.840B.300D/glove.840B.300d.txt", seed=1, workers=1, compute_loss=True)
workers=1, compute_loss=True)
model.build_vocab(documents)
for epoch in range(10):
for i in range(model_glove.epochs):
model.train(documents, total_examples = token_count, epochs=1)
training_loss = model.get_latest_training_loss()
print("Training Loss: " + str(training_loss))
model.alpha -= 0.002 # decrease the learning rate
model.min_alpha = model.alpha # fix the learning rate, no decay
I know this code is a bit awkward, but it is used here only to follow the loss.
The error I receive is:
AttributeError: 'Doc2Vec' object has no attribute 'get_latest_training_loss'
I tried looking at model. and auto-complete and found that indeed there is no such function, I found something similar name training_loss, but it gives me the same error.
Anyone here can give me an idea?
Thanks in Advance
Especially as a beginner, there's no pressing need to monitor training-loss. For a long time, gensim didn't report it in any way for any models – and it was still possible to evaluate & tune models.
Even now, running-loss-reporting in gensim kind of a rough, incomplete, advanced/experimental feature – and after a recent refactoring it doesn't seem to have full support in Doc2Vec. (Notably, while having the loss level reach a plateau can be a helpful indicator that further training can't help, it is most definitely not the case that a model with arbitrarily-lower-loss is better than others. In particular, a model that achieves near-zero loss would likely be extremely overfit, and probably of little use for downstream applications.)
Regarding your general aim, of getting good vectors, with regard to the process you've described/shown:
Tiny tests (as with your 12 documents) don't really work with these algorithms, except to check that you're calling the steps with legal parameters. You shouldn't expect the similarities in such toy-sized tests to mean anything, even if they superficially meet expectations in some cases. The algorithms need lots of training data & large vocabularies to train sensible models. (So, your full 2.4 million docs should work well.)
You generally shouldn't be changing the default alpha/min_alpha values, or call train() multiple times in a loop. You can just leave those at their defaults, and call train() with your desired number of training epochs – and it will do the right thing. The approach in your shown code is a suboptimal and fragile anti-pattern – whichever online source you learned it from is misguided and severely outdated.
You haven't shown your inference code, but note that it will re-use the epochs, alpha, and min_alpha cached in the model instance from original initialization, unless you supply other values. And, the default epochs if not-specified is a value inherited from shared code with Word2Vec of just 5. Doing a mere 5 epochs, and leaving the effective alpha at 0.025 the whole time (as alpha=0.025, min_alpha=0.025 does to inference), is unlikely to give good results, especially on short docs. Common epochs values from published work are 10-20 - and doing at least as many for inference as were used for training is typical.
You are showing the use of a pretrained_emb initialization parameter that is not part of the standard gensim library, so perhaps you're using some other fork, based on some older version of gensim.. Note that it's not typical to initialize a Doc2Vec model with word-embeddings from elsewhere before training, so if doing that, you're already in advanced/experimental territory – which is premature if you're still trying to get some basic doc-vectors into reasonable shape. (And, usually people seek tricks like re-used word-vectors if they have a small corpus. With 2.4 million docs, you probably don't have such corpus problems – any word-vectors can be learned from your corpus along with doc-vectors, in the default way.)
I was trying to implement a paper I read. Basically it uses 3 neural network classifiers with different parameter to work on the same loan default data with 9 different training-to-testing ratios.
In order to find best parameter, we use following criterion, when(1) max_iteration=25000
and (2) Loss value is less than 0.008, we measure the accuracy value, and pick the best.
However, when I try to use python sklearnsklearn.neural_network.MLPClassifier to finish this, I met a problem. When the training-to-test ratio increases, the iterating time the program runs drops dramatically. In the mean while, the loss value increases,
Classifier Performance Table.
This is clearly not what I want, the iterating number should keep rising to 25000 before stop.
This is how I defined classifer:
clf1= MLPClassifier(activation='relu',solver='sgd',early_stopping=False,alpha=1e-5,max_iter=25000,\
hidden_layer_sizes=(18),momentum=0.7,learning_rate_init=0.0081,tol=0,random_state=3)
clf2= MLPClassifier(activation='relu',early_stopping=False,solver='sgd',alpha=1e-5,max_iter=25000,\
hidden_layer_sizes=(23),momentum=0.69,learning_rate_init=0.0095,tol=0,random_state=3)
clf3= MLPClassifier(activation='relu',early_stopping=False,solver='sgd',alpha=1e-5,max_iter=25000,\
hidden_layer_sizes=(27),momentum=0.79,learning_rate_init=0.0075,tol=0,random_state=3)
As you can see, I already set Tolerance=0, so that every time when we iterate, it could surely decrease the loss. And I had tried other value, but still, the iterate number is way smaller than I expected.
Hope someone can help me, thanks!
I implemented a simple neural network for classification (one class) of images in python. Layers are simple (image_matrix, 5,1). Using relu and sigmoid for the hidden layers.
I am iterating 5000 times. At first it looks like the cost goes down gradually in a sensible way.
However, no matter how many training examples I use, or what my learning_rate is, the costs starts behaving erratically after around 3000 iterations every time...
cost (click to see image)
Can someone help me understand what's going on?
Thanks
In training models, you should remember that their are multiple local minima for the its cost. Your graph shows that you're cost is moving around this local minima whilst finding your global minimum, which is the goal finding the best performance for a model.
1st - you should probably try checking for accuracy, f1-score, or loss per iteration/epoch to check if the performance is actually improving.
2nd - do cross validation and check for same metrics above for validation
3rd - implement an early stopping function that should check if you're model is improving or not.
*note: find the best alpha that would help you find the global minimum better.
After fixing my code and prepare my data for training I've found myself in front of 2 question.
Background:
I have data made of date (one entry per minute) for the first column and congestion (value, between 0 and 200) for the 2nd. My goal is to feed it to my neural network and so be able to predict for the next week the congestion at each minute (my dataset is more than 10M of entry, I shouldn't have problem of lack of data for training).
Problem:
I now have two question. First about the loss, optimizer and linear. It seem there is a certain number of them and they all have a domain where they are better than the other, which one would you recommend for this project? (Currently on my test I use Adam as an optimizer and mean_square as loss and linear for activation).
My second question is more like an error that I have (may be linked to me using the wrong loss/optimizer). When using my code (10 000 data of training for now) I have an accuracy of 0, a low loss (0.00X) and a bad prediction (not even close to the reality). Do you have any idea of where it could come from?
What you are trying to do is called time series prediction (given data at time t-n, t-(n+1) ... t-1: predict the state at time t) and is generally a task for a recurrent neural network. Here is the great blog post by Andrej Karpathy about the topic that you should have a look at.
About your two questions:
This is hard to answer since the question of what optimizer to use highly depends on the input data. Generally speaking the network will converge no matter what optimizer you use. The time it takes to converge will differ however. Adaptive learning-rate methods, like Adagrad, Adadelta, and Adam tend to achieve convergence slightly faster. Here is a good write-up of the different optimizers.
Basic neural networks (MLPs) don't do well with time series prediction. That would be an explanation for the low accuracy. However I don't know why the loss would be 0.