I developed the ML for my research, and it needs to be trained with a large amount of data, actually I have to train it for 100 epochs. But, my macbook (m2 13") can't hold it, and also I have to use my laptop for studying, so I can't leave it train all day. I WANNA KNOW. If I seperate the amount of epochs into 10 epochs and train the ML with the same dataset in 10 days, will it give the same result with training 100 epochs for one time?
I use "Yolov5".
It would be easier to answer if you gave us the libraries you're using for your project.
But you should search for training checkpoints these let you train for a certain number of epochs and then start from the last checkpoint and train another X epochs later.
Related
So i have this code that is an ASR using transformers https://www.kaggle.com/code/bernardoolisan/speechrecognition-dot, the problem is that I was using a dataset of 25 hours of audio, the ASR works but when using new data it sucks, this is because of 25 hours, at least you should use 100 up hours for training.
So I decided to use LibriSpeech dataset that contains 1000, 360, 100 hours for training, the thing is that when I use libriSpeech when training I get
loss:nan
Picture of loss:nan when training
Why am I getting loss:nan? it was working with the old dataset but now is not, and I did not change any parameter...
Any idea?
I'm using a tf.data dataset containing my training data consisting of (lets say) 100k images.
I'm also using a tf.data dataset containing my validation set.
Since an epoch of all 100k images takes quite long (in my case approximately one hour) before I get any feedback on performance on the validation set, I set the steps_per_epoch parameter in tf.keras.Model fit() to 10000.
Using a batch size of 1 this results into having 10 validation scores when reaching 100k of images.
In order to complete one epoch of 100k images of my entire training dataset, I set the epochs parameter to 10
However, I'm not sure if using steps_per_epoch and epochs this way has any other consequences. Is it correct to use these parameters in order to get more frequent feedback on performance?
And also a more specific question, does it use all 100k images or does it use the same first 10k images of my training set at every 'epoch'?
I already dug into the TensorFlow docs and read several different stack overflow questions, but I couldn't find anything conclusive to answer my own question. Hope you can help!
Tensorflow version I'm using is 2.2.0.
Is it correct to use these parameters in order to get more frequent
feedback on performance?
Yes, it is correct to use these parameters. Here is the code that i used to fit the model.
model.fit(
train_data,
steps_per_epoch = train_samples//batch_size,
epochs = epochs,
validation_data = test_data,
verbose = 1,
validation_steps = test_samples//batch_size)
does it use all 100k images or does it use the same first 10k images of my
training set at every 'epoch'?
It use all images in your training data.
For better understanding Epoch is the number times the learning algorithm will work through the entire training data set.
Where as steps_per_epoch is the total number of samples in your training data set divided by the batch size.
For example, if you have 100000 training samples and use a batch size of 100, one epoch will be equivalent to 1000 steps_per_epoch.
Note: We generally observe batch size to be the power of 2, this is because of the effective work of optimized matrix operation libraries.
I'm building a Keras sequential model to do a binary image classification. Now when I use like 70 to 80 epochs I start getting good validation accuracy (81%). But I was told that this is a very big number to be used for epochs which would affect the performance of the network.
My question is: is there a limited number of epochs that I shouldn't exceed, note that I have 2000 training images and 800 validation images.
If the number of epochs are very high, your model may overfit and your training accuracy will reach 100%. In that approach you plot the error rate on training and validation data. The horizontal axis is the number of epochs and the vertical axis is the error rate. You should stop training when the error rate of validation data is minimum.
You need to have a trade-off between your regularization parameters. Major problem in Deep Learning is overfitting model. Various regularization techniques are used,as
i) Reducing batch-size
ii) Data Augmentation(only if your data is not diverse)
iii) Batch Normalization
iv) Reducing complexity in architecture(mainly convolutional layers)
v) Introducing dropout layer(only if you are using any dense layer)
vi) Reduced learning rate.
vii) Transfer learning
Batch-size vs epoch tradeoff is quite important. Also it is dependent on your data and varies from application to application. In that case, you have to play with your data a little bit to know the exact figure. Normally a batch size of 32 medium size images requires 10 epochs for good feature extraction from the convolutional layers. Again, it is relative
There's this Early Stopping function that Keras supply which you simply define.
EarlyStopping(patience=self.patience, verbose=self.verbose, monitor=self.monitor)
Let's say that the epochs parameter equals to 80, like you said before. When you use the EarlyStopping function the number of epochs becomes the maximum number of epochs.
You can define the EarlyStopping function to monitor the validation loss, for example, when ever this loss does not improve no more it'll give it a few last chances (the number you put in the patience parameter) and if after those last chances the monitored value didn't improve the training process will stop.
The best practice, in my opinion, is to use both EarlyStopping and ModelCheckpoint, which is another callback function supplied in Keras' API that simply saves your last best model (you decide what best means, best loss or other value that you test your results with).
This is the Keras solution for the problem your trying to deal with. In addition there is a lot of online material that you can read about how to deal with overfit.
Yaa! Their is a solution for your problem. Select epochs e.g 1k ,2k just use early stoping on your neural net.
Early Stopping:
Keras supports the early stopping of training via a callback called Early-stopping.
This callback allows you to specify the performance measure to monitor, the trigger, and once triggered, it will stop the training process. For example you apply a trigger that stop the training if accuracy is not increasing in previous 5 epochs. So keras will see previous 5 epochs through call backs and stop training if your accuracy is not increasing
Early Stopping link :
I have a custom dataset of approximately 20k images (10% used of validation).
I have roughly 1/3 in label class 0, 1/3 in label class 1, and 1/3 that do not have class 0, or 1 objects with a -1 label.
I have run approximately 400 epochs, the last 40 epochs validation mAP has increased from 0.817 TO 0.831, and training cross entropy loss from 0.377->0.356
the last epoch had validation mAP <score>=(0.83138943309)
train cross_entropy <loss>=(0.356147519184)
train smooth_l1 <loss>=(0.150637295831)
The training loss still seems like its got a reasonable amount to reduce but I don't have any experience with resnet (on yolov3 this data set quickly went below .1)
Is my approach of have 1/3 of the training images not have either class present reasonable? When I was doing yolov3 training it seemed to help the network avoid false positives.
Is there any rule of thumb that helps me estimate how many epochs are appropriate based on the number of classes/images?
Its cost me about 100 bucks on aws to get to this point, I'm not sure if it needs another 100 bucks or 1000 bucks to get to the optimal mAP - at the current rate it appears 1 hour is making about 1% improvement - and i'd expect that to slow down.
Are there other metrics I should be looking at? (if so how do i export them)?
are there any hyperparameters I should change, and resume training?
My hyperparameters are:
base_network='resnet-50',
num_classes=2,
mini_batch_size=32,
epochs=200,
learning_rate=0.001,
lr_scheduler_step='3,6',
lr_scheduler_factor=0.1,
optimizer='sgd',
momentum=0.9,
weight_decay=0.0005,
overlap_threshold=0.5,
nms_threshold=0.45,
image_shape=416,
label_width=480,
num_training_samples=19732)
thanks,
John
It's hard to say ahead of time for a custom dataset because you're dealing with many different variables. Tracking the validation mAP is of course a good way to tell you when to stop. For example, the mAP stops increasing, or mAP is leveling out.
So beyond that, I would recommend looking at others who used the same architecture and similar parameters to gain an insight. You mentioned a custom dataset, but for ImageNet, DAWNBench publishes that info. For example, this page lists the hyperparameters per epoch for you to explore of a related setup.
I would also urge you to look at fine tuning pre-trained models to save money and computation. See the Vision section here
and here
and
https://github.com/apache/incubator-mxnet/issues/4616
for information on fine-tuning the FC layers.
As we know, Caffe supports resuming training when the snapshot is given. An explanation of Caffe's training continuation scheme can be found here. However, I found the training loss and validation loss is inconsistent. I gives the following example to illustrate my point. Suppose, I am training a neural network with maximum iteration 1000, and every 100 training iteration it will keep a snapshot. This is done using the following command:
caffe train -solver solver.prototxt
where the batch size is selected to be 64, and in solver.prototxt we have:
test_iter: 4
max_iter: 1000
snapshot: 100
display: 100
test_interval: 100
We select test_iter=4 carefully so that it will perform testing on nearly all the validation dataset (there are 284 validation samples, a little larger than 4*64).
This will gives us a list of .caffemodel and .solverstate files. For example, we may have solver_iter_300.solverstate and solver_iter_300.caffemodel. When generating these two files, we can also see the training loss (13.7466) and validation loss (2.9385).
Now, if we use the snapshot solver_iter_300.solverstate to continue training:
caffe train -solver solver.prototxt -snapshot solver_iter_300.solverstate
We can see the training loss and validation loss are 12.6 and 2.99 respectively. They are different from before. Any ideas? Thanks.