I am using os.system() to execute the first commend mentioned here. It essentially trains a fasttext supervised model with autotune enabled. But important thing is that, the program flow continues to next line before train completes and also before saving the model file. Both the training and model file saving happen with this single command.
So, basically I want to prevent the program to move to the next line until the commend executes, how to achieve it?
Related
I'm training my DQN and it often happens that I want to change a setting in the middle of the training. I know there is the option to terminate the running code via the terminal with CTRL+C but I'd like to intervene only after the currently running epoch has finished. Is there a way to implement that. (I'm using VS-Code)
You could insert a input() after every epoch and use this.. But that means that you have to sit in front of the computer the whole execution time to make the program execute without long pauses
So basically I wrote a keras generator to download images from a web server while it trains. This was an attempt to speed up training so that the training didn't have to wait for the entire batch of images to download before the training began.
To really speed this up I'd like to enable multiprocessing on the keras fit_generator function. However, sometimes keras attempts to download the same image multiple times at once. Not only is this inefficient, it also crashes the program as multiple processes attempt to write to the same file at once. This problem doesn't happen when multiprocessing is False, even with multiple workers. I assume this is due to the GIL.
Normally you could use locks to ensure that the same file is only written to once. However, I don't see how to do this using keras. If anyone could give me some pointers that'd be great. Thanks for reading.
Here's the code that crashes:
image_name = str(image['image'])
try:
obj = self.client.get_object(Bucket=S3_BUCKET, Key=SRC_IMG_FOLDER + image_name)
obj_image = Image.open(obj['Body'])
if self.image_extension not in image_name:
image_name += self.image_extension
obj_image.save(self.image_path(image_index))
The saving of the file is where I run into issues.
Alright so I think I figured it out if anyone in the future has a similar problem. I simply use os.open(file_path, os.O_CREAT | os.O_EXCL) which will atomically open a file. I catch the FileExistsError and wait for the file to finish downloading in that process. This ensures that the same file cannot be downloaded twice.
I recently started to use Google Colab to train my CNN model. It always needs about 10+ hours to train once. But I cannot stay in the same place during these 10+ hours, so I always poweroff my notebook and let the process keep going.
My code will save models automatically. I figured out that when I disconnect from the Colab, the process are still saving models after disconnection.
Here are the questions:
When I try to reconnect to the Colab notebook, it always stuck at "INITIALIZAING" stage and can't connect. I'm sure that the process is running. How do I know if the process is OVER?
Is there any way to reconnect to the ongoing process? It will be nice to me to observe the training losses during the training.
Sorry for my poor English, thanks alot.
Output your loss results to a log file saved in your drive, and periodically check this file.
You can run your training process like:
!log_file = "/content/drive/My Drive/path/log.log"
!python train.py > "${log_file}"
first question: restart runtime from runtime menu
second question: i think you can use tensorboard to monitor your work.
It seems there's no normal way to do this. But you can save your model to Google Drive with current training epoch number, so when you see something like "my_model_epoch_1000" on your google drive, you will know that the process is over.
I am conducting experiments with neural networks using Keras + TensorFlow backend. I do this using GPU on my PC, running Windows 7.
My workflow looks like the following.
I create a small python script that defines a model then runs model.fit_generator with ~50 epochs and early stopping, if validation accuracy does not improve after 10-15 epochs. Then I run a terminal with a command like python model_v3_4_5.py
Usually one epoch takes about 1.5 hours. During this period some new ideas (training parameters or new architecture) come into my head.
Then I create a new python script...
During experiments I've found that it is better not to train several models in parallel. I've experienced doubling of epoch time and strange decrease of validation accuracy.
Therefore, I'd like to wait until the first training finishes then run the second one. Simultaneously, I'd like to avoid idling of my PC and run a new training immediately after the first one has finished.
But I don't know exactly when the first training finishes, therefore, running commands like timeout <50 hours> && python model_v3_4_6.py would be a dumb solution.
Then I need some kind of a queue manager.
One solution that have come to my mind is installing Jenkins slave on my PC and use queues that Jenkins provides. As far as I remember, Jenkins has issues with GPU access.
Another variant - training models in the Jupyter notebook in separate cells. However, I cannot see queue of cell execution here. And this is a topic, being discussed.
Update. Next variant. Add to the model scripts some code, retrieving current GPU state (does it run NN currently?) and wait if it is calculating. This will produce issues in case of several scripts (more than one bright new idea :) ) waiting for GPU to idle.
Are there any other variants?
Finally, I've come up to the simple cmd script
set PYTHONPATH=%CD%
:start
for %%m in (train_queue\model*.py) do (
python %%m
del %%m
)
timeout 5
goto start
One creates a subdirectory train_queue and puts scripts with models in it. All scripts log their output to files, whose names contain timestamps.
This script also calls timeout program
I'm running a RNN demo at https://github.com/suriyadeepan/easy_seq2seq/blob/master/execute.py, everything runs soomthly except I don't know when should it stop.
The train() method in this module (exectue.py) doesn't seem have stop condition. Anyone else has ever run this demo too? How can this method stop? Kill it by yourself? If so, when?
Thanks for help.
The train() method will not stop on it's own, as it contains an infinite-loop. The train() method periodically saves the model after a certain number of iterations, depending on the settings in seq2seq.ini.
You should cancel the training when you are ready (with CTRL + C). You can then run the most-recently saved model in 'test' or 'serve' mode. You can change the mode in seq2seq.ini and then run python execute.py again to run the code in that mode.