Has anyone figured out a way to keep files persisted across sessions in Google's newly open sourced Colaboratory?
Using the sample notebooks, I'm successfully authenticating and transferring csv files from my Google Drive instance and have stashed them in /tmp, my ~, and ~/datalab. Pandas can read them just fine off of disk too. But once the session times out , it looks like the whole filesystem is wiped and a new VM is spun up, without downloaded files.
I guess this isn't surprising given Google's Colaboratory Faq:
Q: Where is my code executed? What happens to my execution state if I close the browser window?
A: Code is executed in a virtual machine dedicated to your account. Virtual machines are recycled when idle for a while, and have a maximum lifetime enforced by the system.
Given that, maybe this is a feature (ie "go use Google Cloud Storage, which works fine in Colaboratory")? When I first used the tool, I was hoping that any .csv files that were in the My File/Colab Notebooks Google Drive folder would be also loaded onto the VM instance that the notebook was running on :/
Put that before your code, so will always download your file before run your code.
!wget -q http://www.yoursite.com/file.csv
Your interpretation is correct. VMs are ephemeral and recycled after periods of inactivity. There's no mechanism for persistent data on the VM itself right now.
In order for data to persist, you'll need to store it somewhere outside of the VM, e.g., Drive, GCS, or any other cloud hosting provider.
Some recipes for loading and saving data from external sources is available in the I/O example notebook.
Not sure whether this is the best solution, but you can sync your data between Colab and Drive with automated authentication like this: https://gist.github.com/rdinse/159f5d77f13d03e0183cb8f7154b170a
Include this for files in your Google Drive:
from google.colab import drive
drive.mount('/content/drive')
After it runs you will see it mounted in your files tab and you can access your files with the path:
'/content/drive/MyDrive/<your folder inside drive>/file.ext'
Clouderizer may provide some data persistence, at the cost of a long setup(because you use google colab only as a host) and little space to work on.
But, in my opinion that's best than have your file(s) "recycled" when you forget to save your progress.
As you pointed out, Google Colaboratory's file system is ephemeral. There are workarounds, though there's a network latency penalty and code overhead: e.g. you can use boilerplate code in your notebooks to mount external file systems like GDrive (see their example notebook).
Alternatively, while this is not supported in Colaboratory, other Jupyter hosting services – like Jupyo – provision dedicated VMs with persistent file systems so the data and the notebooks persist across sessions.
If anyone's interested in saving and restoring the whole session, here's a snippet I'm using that you might find useful:
import os
import dill
from google.colab import drive
backup_dir = 'drive/My Drive/colab_sessions'
backup_file = 'notebook_env.db'
backup_path = backup_dir + '/' + backup_file
def init_drive():
# create directory if not exist
drive.mount('drive')
if not os.path.exists(backup_dir):
!mkdir backup_dir
def restart_kernel():
os._exit(00)
def save_session():
init_drive()
dill.dump_session(backup_path)
def load_session():
init_drive()
dill.load_session(backup_path)
Edit: Works fine until your session size is not too big. You need to check if it works for you..
I was interested in importing a module in a separate .py file.
What I ended up doing is copying the .py file contents to the first cell in my notebook, adding the following text as the first line:
%%writefile mymodule.py
This creates a separate file named mymodule.py in the working directory so your notebook can use it with an import line.
I know that by running all of the code in the module would enable using the variables and functions in the notebook, but my code required importing a module, so that was good enough for me.
Related
I am using this code snippet to download files from google colab:
from google.colab import files
files.download("sample_data/california_housing_train.csv")
print("done")
But the output is
done
Downloading "california_housing_train.csv"
The download function is asynchronous. This may not seem much but actually the function is being run in a loop so after starting download some external libraries are called in starting of next iteration and it clears the output. Also, I have tested it many times that in case the output is cleared before the file is downloaded, it will not download.
Also, please dont suggest downloading it from the file menu. There are going to be a lot of files so it has to be programmatically. Also, please dont suggest the method to zip it programmatically and then download it from file browser, as i have to download the files after each iteration bcz I am using a workaround for colab's "are you still there" but it stills considers my session idle and deleted runtime and all the files are lost.
Thank you in advance. Sorry for any grammatical errors.
Edit:
I have also tried using sleep method but that didn't work as sometimes the files take more time to load despite very very good internet connection compared to the size of the file. Keeping very high values of sleep() is not good as it is not the best way. still, if i find nothing else i will use sleep() only.
OK, so I found a temporary workaround which is not wholly programmable and may need manual downloading in cases where runtime gets deleted due to inactivity before executing whole program.
So, my idea is to mount google drive on the runtime and copy the files to a folder in google drive after each execution. In the end zip the folder in drive and download it by files.download(path to drive zip) . In case the runtime was not able to execute whole code and dies prematurely, the folder in the google drive will remain unlike it being deleted when runtime is destroyed, so we can go and manually download the folder from google drive.
If anyone else has a better solution, please let me know.
I have a Google Colab Notebook that is using psycopg2 to connect with a free Heroku PostgreSQL instance. I'd like to share the notebook with some colleagues for educational purposes to view and run the code.
There is nothing sensitive related to the account / database but would still like to hide the credentials used to make the initial connection without restricting their access.
My work around was creating a Python module that contained a function who performed the initial connection with credentials. I converted the module into a binary .pyc, uploaded it to Google Drive, downloaded the binary into the Notebook's contents via shell command then used it as an import.
It obviously isn't secure but provides the obfuscation layer I was looking for.
I'm using google colab for a tensorflow project. but whenever I terminate the session all the files and work I've done gets wiped out all there is the ipynb file which I was using. then I have to redo everything from the beginning. these are the file
I loose all these files which I'm using then reupload them when I open my ipynb file the next time. how can solve this problem. should I push this entire file structure to git repo and clone it next time I'm using it? or is their another way to do it?
amanpreet!. Yes , you can put all your files in Github ,you can also put all the files in your google drive and access it by mounting drive in Colab . attaching an article for your reference.
https://buomsoo-kim.github.io/colab/2020/05/09/Colab-mounting-google-drive.md/
I am working on a Google Colab notebook that requires the user to mount google drive using the colab.drive python library. They then input relative paths on the local directory tree (/content/drive/... by default on that mount) to files of interest for analysis. Now, I want to use a Google Sheet they can create as a configuration file. There is lots of info on how to authenticate gspread and fetch a sheet from its HTTPS url, but I can't find any info on how to access the .gsheet file using gspread that is already mounted on the local filesystem of the colab runtime.
There are many tutorials using this flow: https://colab.research.google.com/notebooks/io.ipynb#scrollTo=yjrZQUrt6kKj , but I don't want to make the user authenticate twice (having already done so for the initial mount), and i don't want to make them input some files as relative path, some as HTTPS URL.
I had thought this would be quite like using gspread to work with google sheets that I might have on my locally mounted drive as well. But, I haven't seen this workflow anywhere either. Any pointers in that direction might help me out as well.
Thank you!
Instead of adding .gsheet on colab's drive you can try storing it in the user's drive and later fetch from there when needed. So until that kernel is running you won't have to re-authenticate the user.
I'm also not finding anything to authenticate into colab from other device. So you would consider modifying your flow a bit.
Any ideas how I automatically send some files (mainly Tensorflow models) after training in Google AI platform to another compute instance or my local machine? I would like to run in my trainer for instance something like this os.system(scp -r ./file1 user#host:/path/to/folder). Of course I don’t need to use scp. It’s just an example. Is there such a possibility in Google? There is no problem to transfer files from job to Google Cloud Storage like this os.system('gsutil cp ./example_file gs://my_bucket/path/'). However when I try for example os.system('gcloud compute scp ./example_file my_instance:/path/') to transfer data from my AI platform job to another instance I get Your platform does not support SSH. Any ideas how can I do this?
UPDATE
Maybe there is a possibility to automatically download all the files from the google cloud storage which are in chosen folder? So I would for instance upload data from my job instance to the google cloud storage folder and my another instance would automatically detect changes and download all the new files?
UPDATE2
I found gsutil rsync but I am not sure whether it can be constantly running in the background? At this point the only solution that comes to my mind is to use cron job in the backend and run gsutil rsync for example every 10 minutes. But is doesn't seem to be optimal solution. Maybe there is a built-in tool or another better idea?
rsync command makes the contents under destination the same as the contents under source, by copying any missing files/objects (or those whose data has changed), and (if the -d option is specified) deleting any extra files/objects. source must specify a directory, bucket, or bucket subdirectory. But it command does not run in background.
Remember that the Notebook you're using is in fact a VM running JupyterLab, based on that you could run the command rsync once Tensorflow finished to created the files and sync it with a directory in another instance trying something like:
import os
os.system("rsync -avrz Tensorflow/outputs/filename root#ip:Tensorflow/otputs/file")
I suggest you to take a look in the rsync documentation to know all the options available to use that command.