google colab - Download files synchronously - python

I am using this code snippet to download files from google colab:
from google.colab import files
files.download("sample_data/california_housing_train.csv")
print("done")
But the output is
done
Downloading "california_housing_train.csv"
The download function is asynchronous. This may not seem much but actually the function is being run in a loop so after starting download some external libraries are called in starting of next iteration and it clears the output. Also, I have tested it many times that in case the output is cleared before the file is downloaded, it will not download.
Also, please dont suggest downloading it from the file menu. There are going to be a lot of files so it has to be programmatically. Also, please dont suggest the method to zip it programmatically and then download it from file browser, as i have to download the files after each iteration bcz I am using a workaround for colab's "are you still there" but it stills considers my session idle and deleted runtime and all the files are lost.
Thank you in advance. Sorry for any grammatical errors.
Edit:
I have also tried using sleep method but that didn't work as sometimes the files take more time to load despite very very good internet connection compared to the size of the file. Keeping very high values of sleep() is not good as it is not the best way. still, if i find nothing else i will use sleep() only.

OK, so I found a temporary workaround which is not wholly programmable and may need manual downloading in cases where runtime gets deleted due to inactivity before executing whole program.
So, my idea is to mount google drive on the runtime and copy the files to a folder in google drive after each execution. In the end zip the folder in drive and download it by files.download(path to drive zip) . In case the runtime was not able to execute whole code and dies prematurely, the folder in the google drive will remain unlike it being deleted when runtime is destroyed, so we can go and manually download the folder from google drive.
If anyone else has a better solution, please let me know.

Related

How to download numpy array files from an online drive

I have a dataset contains hundreds of numpy arrays looks like this,
I am trying to save them to an online drive so that I can run the code with this dataset remotely from a sever. I cannot access the drive of the server but can only run code script and access the terminal. So I have tried with google drive and Onedrive, and looked up how to generate a direct download link from those drives but it did not work.
In short, I need to be able to get those files from my python scripts. Could anyone give some hints?
You can get the download URLs very easily from Drive. I assume that you already uploaded the files into a Drive folder. Then you can easily set up a scenario to download the files on Python. First you would need an environment on Python to connect to Drive. If you don't currently have one, you can follow this guide. That guide will install the required libraries, credentials and run a sample script. Once you can run the sample script you can make minor modifications to reach your goal.
To download the files you are going to need their ids. I am assuming that you already know them, but if you don't you could retrieve them by doing a Files.list on the folder where you keep the files. To do so you can use '{ FOLDER ID }' in parents as the q parameter.
To download the files you only have to run a Files.get request by providing the file id. You will find the download URL on the webContentLink property. Feel free to leave a comment if you need further clarifications.

Persisting data in Google Colaboratory

Has anyone figured out a way to keep files persisted across sessions in Google's newly open sourced Colaboratory?
Using the sample notebooks, I'm successfully authenticating and transferring csv files from my Google Drive instance and have stashed them in /tmp, my ~, and ~/datalab. Pandas can read them just fine off of disk too. But once the session times out , it looks like the whole filesystem is wiped and a new VM is spun up, without downloaded files.
I guess this isn't surprising given Google's Colaboratory Faq:
Q: Where is my code executed? What happens to my execution state if I close the browser window?
A: Code is executed in a virtual machine dedicated to your account. Virtual machines are recycled when idle for a while, and have a maximum lifetime enforced by the system.
Given that, maybe this is a feature (ie "go use Google Cloud Storage, which works fine in Colaboratory")? When I first used the tool, I was hoping that any .csv files that were in the My File/Colab Notebooks Google Drive folder would be also loaded onto the VM instance that the notebook was running on :/
Put that before your code, so will always download your file before run your code.
!wget -q http://www.yoursite.com/file.csv
Your interpretation is correct. VMs are ephemeral and recycled after periods of inactivity. There's no mechanism for persistent data on the VM itself right now.
In order for data to persist, you'll need to store it somewhere outside of the VM, e.g., Drive, GCS, or any other cloud hosting provider.
Some recipes for loading and saving data from external sources is available in the I/O example notebook.
Not sure whether this is the best solution, but you can sync your data between Colab and Drive with automated authentication like this: https://gist.github.com/rdinse/159f5d77f13d03e0183cb8f7154b170a
Include this for files in your Google Drive:
from google.colab import drive
drive.mount('/content/drive')
After it runs you will see it mounted in your files tab and you can access your files with the path:
'/content/drive/MyDrive/<your folder inside drive>/file.ext'
Clouderizer may provide some data persistence, at the cost of a long setup(because you use google colab only as a host) and little space to work on.
But, in my opinion that's best than have your file(s) "recycled" when you forget to save your progress.
As you pointed out, Google Colaboratory's file system is ephemeral. There are workarounds, though there's a network latency penalty and code overhead: e.g. you can use boilerplate code in your notebooks to mount external file systems like GDrive (see their example notebook).
Alternatively, while this is not supported in Colaboratory, other Jupyter hosting services – like Jupyo – provision dedicated VMs with persistent file systems so the data and the notebooks persist across sessions.
If anyone's interested in saving and restoring the whole session, here's a snippet I'm using that you might find useful:
import os
import dill
from google.colab import drive
backup_dir = 'drive/My Drive/colab_sessions'
backup_file = 'notebook_env.db'
backup_path = backup_dir + '/' + backup_file
def init_drive():
# create directory if not exist
drive.mount('drive')
if not os.path.exists(backup_dir):
!mkdir backup_dir
def restart_kernel():
os._exit(00)
def save_session():
init_drive()
dill.dump_session(backup_path)
def load_session():
init_drive()
dill.load_session(backup_path)
Edit: Works fine until your session size is not too big. You need to check if it works for you..
I was interested in importing a module in a separate .py file.
What I ended up doing is copying the .py file contents to the first cell in my notebook, adding the following text as the first line:
%%writefile mymodule.py
This creates a separate file named mymodule.py in the working directory so your notebook can use it with an import line.
I know that by running all of the code in the module would enable using the variables and functions in the notebook, but my code required importing a module, so that was good enough for me.

Python: Stop watchdog reacting to partially transferred files?

I have previously written a script using python that monitors a windows directory and uploads any new files to a remote server offsite. The intent is to run it at all times and allow users to dump their files there to sync with the cloud directory.
When a file added is large enough that it is not transferred to the local drive all at once, Watchdog "sees" it as it is partially uploaded and tries to upload the partial file, which fails. How can I ensure that these files are "complete" before they are uploaded? Again, I am on Windows, and cannot use anything but Windows to complete this task, or I would have used inotify. Is it even possible to check the "state" of a file in this way on Windows?
It looks like there is no easy way to do this. I think you can put in place something that checks the stats on the directory when it triggers and only actions after a given amount of time that the folder size hasn't changed:
https://github.com/gorakhargosh/watchdog/issues/184
As a side note, I would check out Apache Nifi. I have used it with a lot of success and it was pretty easy to get up and running
https://nifi.apache.org/

Taking a screenshot of uploaded files using Python/Heroku

There are design files uploaded/downloaded by users on a website. For every uploaded file, I would like to show a screenshot of the file so people can see an image before they download them.
They are very esoteric files though that need to be opened in particular design tools. (I even don't have the software to run them on my local machine).
My thinking is that I can run a virtual machine that has these programs installed, programmatically open the files, and then take a screenshot of the opened file and save a thumbnail. But I want to do this when the user uploads the design file.
But someone told me PIL could do this. But I investigated and can't seem to figure out any documentation on how to go about this.
Has anyone ever done something like this before? What is the best approach?

How can I use Google documents list APIs with the none-document files such as .jpeg .gif?

I'm now using gdata-python-client(Google document List API) to access my google drive on Terminal in Linux OS and I have problem to show the image files -- It's just show only the .doc .xls or .pdf files
Is it has some solutions to solve my problem in still using gdata-python-client? I hope there is some solutions better than changing my APIs to Google drive API,that's mean I should restart my project!!. So sad :(
And If I change to use Google Drive APIs.how to do it? or can i reuse my project working compatibility with the new APIs?
Please give me some advice or tutorial.
Thank you very very very much :)
Use the Drive API. We have a Python command line sample to get you started, and python snippets for every API method including files.list.

Categories