I am training a neural network for Neural Machine Traslation on Google Colaboratory. I know that the limit before disconnection is 12 hrs, but I am frequently disconnected before (4 or 6 hrs). The amount of time required for the training is more then 12 hrs, so I add some savings each 5000 epochs.
I don't understand if when I am disconnected from Runtime (GPU is used) the code is still execute by Google on the VM? I ask it because I can easily save the intermediate models on Drive, and so continue the train also if I am disconnected.
Does anyone know it?
Yes, for ~1.5 hours after you close the browser window.
To keep things running longer, you'll need an active tab.
Related
This message keeps popping out after I used two GPUs simultaneously for two notebooks from the same account for about half an hour (Colab wasn't running for 12 hours):
Photo of pop-out message
You cannot currently connect to a GPU due to usage limits in Colab.
It has been about two hours since I last used colab, but the message still pops up. It would be really great if I know for how long does this lasts. I think it could be 12 hours but would want to know from someone who has experienced this. In my case, the GPUs are available, but due to recent excess computing and running one cell for long, I have reached my usage limit for gpus. However, I want to know that after how much waiting will colab let me use its GPUs again.
Update: It has been more than 2 days and colab still doesn't allow me to use GPUs. The usage limit message still pops out.
GPU allocation per user is restricted to maximum 12 hours at a time. The next time you can use it will probably be after 12 hours or once a user has given up GPU ability.
You may want to check Google Colab Pro which has some advantages over the non-paid version.
The usage limit is pretty dynamic and depends on how much/long you use colab. I was able to use the GPUs after 5 days; however, my account again reached usage limit right after 30mins of using the GPUs (google must have decreased it further for my account). The situation really became normal after months of not using colab from that account. My suggestion is to have multiple google accounts for colab, so you could use the other accounts when facing usage limits. Sorry for not replying to the comments.
How come after 2 hours of running a model, I get a popup window saying:
Runtime disconnected
The connection to the runtime has timed out.
CLOSE RECONNECT
I had restarted my runtime and thought I have 12 hours to train a model. Any Idea how to avoid this? My other question: Is it possible to find out the time left for runtime to get disconnected using a TF or Python API?
Runtime gets disconnected when the notebook goes to "idle" mode for a time greater than 90 minutes. This is an unofficial number, as google colab has no official release about this. This is how google colab gets away with it by answering cheekily:
An extract from the Official Colab FAQ
Where is my code executed? What happens to my execution state if I close the browser window?
Code is executed in a virtual machine dedicated to your account.
Virtual machines are recycled when idle for a while, and have a
maximum lifetime enforced by the system.
So to avoid this, keep your browser open and don't let your system sleep for a time greater than 90 minutes.
This also means if you happen to close your browser within the 90 minutes, then if you reopen the notebook within 90 minutes you will still have all your running processes and session variables intact!
Also, note that currently you can run a notebook for a maximum of 12 hours. (in the "non-idle" state of course).
To answer your second question, this "idle state" stuff is a colab thing. So I don't think TF or Python will have anything to do with it.
So it is good practise to have your models saved into a folder periodically. This way, in the unfortunate event of your runtime getting disconnected, your work will not be lost. And you can simply restart your training from the latest saved model!
PS: I got the number 90 minutes from an experiment done by a fellow user
I run CNN code on Colab notebook, and it takes long time. However, the connection always breaks after I ran 2 or 3 hours and cannot be reconnect back. I was told Colab virture machine would break connection after 12 hours without any operates, so how can I avoid restart my code after connection broken, or any easier way to controll Colab?
Colab restarts on inactivity in 2-3 hours, you need to reconnect it. To simply avoid this show some activity on runtime.
I recently started to use Google Colab to train my CNN model. It always needs about 10+ hours to train once. But I cannot stay in the same place during these 10+ hours, so I always poweroff my notebook and let the process keep going.
My code will save models automatically. I figured out that when I disconnect from the Colab, the process are still saving models after disconnection.
Here are the questions:
When I try to reconnect to the Colab notebook, it always stuck at "INITIALIZAING" stage and can't connect. I'm sure that the process is running. How do I know if the process is OVER?
Is there any way to reconnect to the ongoing process? It will be nice to me to observe the training losses during the training.
Sorry for my poor English, thanks alot.
Output your loss results to a log file saved in your drive, and periodically check this file.
You can run your training process like:
!log_file = "/content/drive/My Drive/path/log.log"
!python train.py > "${log_file}"
first question: restart runtime from runtime menu
second question: i think you can use tensorboard to monitor your work.
It seems there's no normal way to do this. But you can save your model to Google Drive with current training epoch number, so when you see something like "my_model_epoch_1000" on your google drive, you will know that the process is over.
I am conducting experiments with neural networks using Keras + TensorFlow backend. I do this using GPU on my PC, running Windows 7.
My workflow looks like the following.
I create a small python script that defines a model then runs model.fit_generator with ~50 epochs and early stopping, if validation accuracy does not improve after 10-15 epochs. Then I run a terminal with a command like python model_v3_4_5.py
Usually one epoch takes about 1.5 hours. During this period some new ideas (training parameters or new architecture) come into my head.
Then I create a new python script...
During experiments I've found that it is better not to train several models in parallel. I've experienced doubling of epoch time and strange decrease of validation accuracy.
Therefore, I'd like to wait until the first training finishes then run the second one. Simultaneously, I'd like to avoid idling of my PC and run a new training immediately after the first one has finished.
But I don't know exactly when the first training finishes, therefore, running commands like timeout <50 hours> && python model_v3_4_6.py would be a dumb solution.
Then I need some kind of a queue manager.
One solution that have come to my mind is installing Jenkins slave on my PC and use queues that Jenkins provides. As far as I remember, Jenkins has issues with GPU access.
Another variant - training models in the Jupyter notebook in separate cells. However, I cannot see queue of cell execution here. And this is a topic, being discussed.
Update. Next variant. Add to the model scripts some code, retrieving current GPU state (does it run NN currently?) and wait if it is calculating. This will produce issues in case of several scripts (more than one bright new idea :) ) waiting for GPU to idle.
Are there any other variants?
Finally, I've come up to the simple cmd script
set PYTHONPATH=%CD%
:start
for %%m in (train_queue\model*.py) do (
python %%m
del %%m
)
timeout 5
goto start
One creates a subdirectory train_queue and puts scripts with models in it. All scripts log their output to files, whose names contain timestamps.
This script also calls timeout program