tcmalloc: large alloc python in Google Colab - python

I was trying to apply a deep learning algorithm(CNN) in python but after separating training-testing data and transforming time series to image step my Colab Notebook crashed and restarted itself again.
It gives an error like "Your session crashed after using all RAM" and when I checked app.log I saw something about tcmalloc: large alloc. I didn't find anything to fix this crashed.
Do you have any idea how to prevent this warning and fixed this situation?

Your session ran out of all available RAM. You can purchase Colab Pro to get extra RAM or you can use a Higher RAM machine and use the Neural Network there

Related

How long does Colab's Usage limit lasts?

This message keeps popping out after I used two GPUs simultaneously for two notebooks from the same account for about half an hour (Colab wasn't running for 12 hours):
Photo of pop-out message
You cannot currently connect to a GPU due to usage limits in Colab.
It has been about two hours since I last used colab, but the message still pops up. It would be really great if I know for how long does this lasts. I think it could be 12 hours but would want to know from someone who has experienced this. In my case, the GPUs are available, but due to recent excess computing and running one cell for long, I have reached my usage limit for gpus. However, I want to know that after how much waiting will colab let me use its GPUs again.
Update: It has been more than 2 days and colab still doesn't allow me to use GPUs. The usage limit message still pops out.
GPU allocation per user is restricted to maximum 12 hours at a time. The next time you can use it will probably be after 12 hours or once a user has given up GPU ability.
You may want to check Google Colab Pro which has some advantages over the non-paid version.
The usage limit is pretty dynamic and depends on how much/long you use colab. I was able to use the GPUs after 5 days; however, my account again reached usage limit right after 30mins of using the GPUs (google must have decreased it further for my account). The situation really became normal after months of not using colab from that account. My suggestion is to have multiple google accounts for colab, so you could use the other accounts when facing usage limits. Sorry for not replying to the comments.

Tensorflow Colab: Runtime disconnected The connection to the runtime has timed out

How come after 2 hours of running a model, I get a popup window saying:
Runtime disconnected
The connection to the runtime has timed out.
CLOSE RECONNECT
I had restarted my runtime and thought I have 12 hours to train a model. Any Idea how to avoid this? My other question: Is it possible to find out the time left for runtime to get disconnected using a TF or Python API?
Runtime gets disconnected when the notebook goes to "idle" mode for a time greater than 90 minutes. This is an unofficial number, as google colab has no official release about this. This is how google colab gets away with it by answering cheekily:
An extract from the Official Colab FAQ
Where is my code executed? What happens to my execution state if I close the browser window?
Code is executed in a virtual machine dedicated to your account.
Virtual machines are recycled when idle for a while, and have a
maximum lifetime enforced by the system.
So to avoid this, keep your browser open and don't let your system sleep for a time greater than 90 minutes.
This also means if you happen to close your browser within the 90 minutes, then if you reopen the notebook within 90 minutes you will still have all your running processes and session variables intact!
Also, note that currently you can run a notebook for a maximum of 12 hours. (in the "non-idle" state of course).
To answer your second question, this "idle state" stuff is a colab thing. So I don't think TF or Python will have anything to do with it.
So it is good practise to have your models saved into a folder periodically. This way, in the unfortunate event of your runtime getting disconnected, your work will not be lost. And you can simply restart your training from the latest saved model!
PS: I got the number 90 minutes from an experiment done by a fellow user

Does Google-Colab continue running the script when "Runtime disconnected"?

I am training a neural network for Neural Machine Traslation on Google Colaboratory. I know that the limit before disconnection is 12 hrs, but I am frequently disconnected before (4 or 6 hrs). The amount of time required for the training is more then 12 hrs, so I add some savings each 5000 epochs.
I don't understand if when I am disconnected from Runtime (GPU is used) the code is still execute by Google on the VM? I ask it because I can easily save the intermediate models on Drive, and so continue the train also if I am disconnected.
Does anyone know it?
Yes, for ~1.5 hours after you close the browser window.
To keep things running longer, you'll need an active tab.

Python killed on GCP

I have been working on comparison to run deep learning code on local machine and Google Cloud Platform.
The code is about recurrent neural network and it ran perfectly well on local machine.
But on GCP cloud shell, when I want to compile my python file, it shows "Killed"
userID#projectID:~$ python rnn.py
Killed
Is it because that I am out of memory? (because I tried to run line by line, and on the second time I assigned large data to a variable, it stuck.)
My code is somewhat like this
imdb = np.load('imdb_word_emb.npz')
X_train = imdb['X_train']
X_test = imdb['X_test']
on the third line, the machine stuck and showed "Killed"
I tried to change the order of the second and third line, it still stuck at the third line.
My training data is a (25000,80,128)-array. So is my testing data. The data set works perfectly well on my local machine. I am sure there are no problem with this data set.
Or is it because of other reasons?
It would be awesome if people who know how to solve or even few key words tell me how to deal with this. Thank you :D
The error you are getting is because Cloud Shell is not intended for computational or network intensive processes, see Cloud Shell limitations.
I understand you want to compare your local machine with Google Cloud Platform. As stated in the public docs:
"When you start Cloud Shell, it provisions a g1-small Google Compute
Engine"
A g1-small machine type has 1.70GB RAM and a shared physical core. Keeping this in mind and also that is a limited as stated before, your local machine is likely more powerful than Cloud Shell so you'd not see any improvement.
I recommend you to create a Compute Engine instance with a different machine type, you can use a custom machine type to set the number of cores and GB of RAM you want to have. I guess you want to benefit from running your workload faster in Google Compute Engine so you can choose a better machine type than your local one in terms of resources and compare how much it improves.

How to reconnect to the ongoing process on GoogleColab

I recently started to use Google Colab to train my CNN model. It always needs about 10+ hours to train once. But I cannot stay in the same place during these 10+ hours, so I always poweroff my notebook and let the process keep going.
My code will save models automatically. I figured out that when I disconnect from the Colab, the process are still saving models after disconnection.
Here are the questions:
When I try to reconnect to the Colab notebook, it always stuck at "INITIALIZAING" stage and can't connect. I'm sure that the process is running. How do I know if the process is OVER?
Is there any way to reconnect to the ongoing process? It will be nice to me to observe the training losses during the training.
Sorry for my poor English, thanks alot.
Output your loss results to a log file saved in your drive, and periodically check this file.
You can run your training process like:
!log_file = "/content/drive/My Drive/path/log.log"
!python train.py > "${log_file}"
first question: restart runtime from runtime menu
second question: i think you can use tensorboard to monitor your work.
It seems there's no normal way to do this. But you can save your model to Google Drive with current training epoch number, so when you see something like "my_model_epoch_1000" on your google drive, you will know that the process is over.

Categories