I recently started to use Google Colab to train my CNN model. It always needs about 10+ hours to train once. But I cannot stay in the same place during these 10+ hours, so I always poweroff my notebook and let the process keep going.
My code will save models automatically. I figured out that when I disconnect from the Colab, the process are still saving models after disconnection.
Here are the questions:
When I try to reconnect to the Colab notebook, it always stuck at "INITIALIZAING" stage and can't connect. I'm sure that the process is running. How do I know if the process is OVER?
Is there any way to reconnect to the ongoing process? It will be nice to me to observe the training losses during the training.
Sorry for my poor English, thanks alot.
Output your loss results to a log file saved in your drive, and periodically check this file.
You can run your training process like:
!log_file = "/content/drive/My Drive/path/log.log"
!python train.py > "${log_file}"
first question: restart runtime from runtime menu
second question: i think you can use tensorboard to monitor your work.
It seems there's no normal way to do this. But you can save your model to Google Drive with current training epoch number, so when you see something like "my_model_epoch_1000" on your google drive, you will know that the process is over.
Related
I have created a Data Quality monitoring from Sagemaker Studio UI and also created using sagemaker SDK code, I referred to create model Data Quality monitoring job.
Errors:
when there is no captured data (this is expected)
Monitoring job failure reason:
Job inputs had no data
From logs, I can see that it is using Java in background. Not sure how to debug?
org.json4s.package$MappingException: Do not know how to convert
JObject(List(0,JDouble(38.0))) into class java.lang.String.
Once we create the DataQuality monitoring job using Sagemaker Studio UI or Sagemkaer python sdk, it is taking a hour to start. I would like to know is there a way to debug monitoring job without waiting for a hour every time we get a error?
For development, it might be easier to trigger execution of the monitoring job manually. Take a look at this python code
If you want to see how it's used, open the lab 5 notebook of the workshop and scroll almost to the end, to the cells right after the "Triggering execution manually" title.
I am using os.system() to execute the first commend mentioned here. It essentially trains a fasttext supervised model with autotune enabled. But important thing is that, the program flow continues to next line before train completes and also before saving the model file. Both the training and model file saving happen with this single command.
So, basically I want to prevent the program to move to the next line until the commend executes, how to achieve it?
How come after 2 hours of running a model, I get a popup window saying:
Runtime disconnected
The connection to the runtime has timed out.
CLOSE RECONNECT
I had restarted my runtime and thought I have 12 hours to train a model. Any Idea how to avoid this? My other question: Is it possible to find out the time left for runtime to get disconnected using a TF or Python API?
Runtime gets disconnected when the notebook goes to "idle" mode for a time greater than 90 minutes. This is an unofficial number, as google colab has no official release about this. This is how google colab gets away with it by answering cheekily:
An extract from the Official Colab FAQ
Where is my code executed? What happens to my execution state if I close the browser window?
Code is executed in a virtual machine dedicated to your account.
Virtual machines are recycled when idle for a while, and have a
maximum lifetime enforced by the system.
So to avoid this, keep your browser open and don't let your system sleep for a time greater than 90 minutes.
This also means if you happen to close your browser within the 90 minutes, then if you reopen the notebook within 90 minutes you will still have all your running processes and session variables intact!
Also, note that currently you can run a notebook for a maximum of 12 hours. (in the "non-idle" state of course).
To answer your second question, this "idle state" stuff is a colab thing. So I don't think TF or Python will have anything to do with it.
So it is good practise to have your models saved into a folder periodically. This way, in the unfortunate event of your runtime getting disconnected, your work will not be lost. And you can simply restart your training from the latest saved model!
PS: I got the number 90 minutes from an experiment done by a fellow user
I am training a neural network for Neural Machine Traslation on Google Colaboratory. I know that the limit before disconnection is 12 hrs, but I am frequently disconnected before (4 or 6 hrs). The amount of time required for the training is more then 12 hrs, so I add some savings each 5000 epochs.
I don't understand if when I am disconnected from Runtime (GPU is used) the code is still execute by Google on the VM? I ask it because I can easily save the intermediate models on Drive, and so continue the train also if I am disconnected.
Does anyone know it?
Yes, for ~1.5 hours after you close the browser window.
To keep things running longer, you'll need an active tab.
I am conducting experiments with neural networks using Keras + TensorFlow backend. I do this using GPU on my PC, running Windows 7.
My workflow looks like the following.
I create a small python script that defines a model then runs model.fit_generator with ~50 epochs and early stopping, if validation accuracy does not improve after 10-15 epochs. Then I run a terminal with a command like python model_v3_4_5.py
Usually one epoch takes about 1.5 hours. During this period some new ideas (training parameters or new architecture) come into my head.
Then I create a new python script...
During experiments I've found that it is better not to train several models in parallel. I've experienced doubling of epoch time and strange decrease of validation accuracy.
Therefore, I'd like to wait until the first training finishes then run the second one. Simultaneously, I'd like to avoid idling of my PC and run a new training immediately after the first one has finished.
But I don't know exactly when the first training finishes, therefore, running commands like timeout <50 hours> && python model_v3_4_6.py would be a dumb solution.
Then I need some kind of a queue manager.
One solution that have come to my mind is installing Jenkins slave on my PC and use queues that Jenkins provides. As far as I remember, Jenkins has issues with GPU access.
Another variant - training models in the Jupyter notebook in separate cells. However, I cannot see queue of cell execution here. And this is a topic, being discussed.
Update. Next variant. Add to the model scripts some code, retrieving current GPU state (does it run NN currently?) and wait if it is calculating. This will produce issues in case of several scripts (more than one bright new idea :) ) waiting for GPU to idle.
Are there any other variants?
Finally, I've come up to the simple cmd script
set PYTHONPATH=%CD%
:start
for %%m in (train_queue\model*.py) do (
python %%m
del %%m
)
timeout 5
goto start
One creates a subdirectory train_queue and puts scripts with models in it. All scripts log their output to files, whose names contain timestamps.
This script also calls timeout program