Download entire directories from google cloud storage bucket using python - python

I have a google cloud storage bucket, I can download objects using the download_blob function in python, I can also use the gsutil cp or gcloud compute scp function to download entire directories.
Is there a way to download an entire directory from the storage bucket using python as a single zip file.
doing it this way Python - download entire directory from Google Cloud Storage requires me to download file by file.
Is there a way to download an entire directory at once ?

Cloud Storage has no concept of "directories" -- instead, each blob can be given a name that can resemble a directory path. Downloading entire "directories" is the same as downloading all blobs with the same prefix.
This means that you can use wildcards with gsutil:
gsutil -m cp gs://bucket/data/abc/* .
would copy every blob whose name starts with /data/abc/.

Related

Creating a tar archive from files stored in Google Cloud Storage

I would like create a tar archive from multiple files in GCS and store the resulting tar in GCS as well. It doesn't look like gsutil or the python google cloud storage api provide the ability to do this - is my only option to download the files, tar them, and then upload the tar to GCS? The fileset has the potential to be quite large so ideally I would like to accomplish this without downloading all the source files.

upload folders inside a folder to google cloud using python

Previously I was working in AWS and I am new in Google Cloud, in AWS there was a way to upload directories/folder to bucket. I have done bit of research for uploading directory/folder in Google Cloud bucket but couldn't find. Can someone help me.I would like to upload some folders(not files) inside a folder to google cloud using python.How to do that?
To achieve this, you need to upload file by file the content on each directory and replicate the path that you have locally in your GCS bucket.
Note: directory doesn't exist in GCS, it's simply a set of the same file path prefix presented as directory in the UI

dump files downloaded by google Colab in temporary location to google drive

I have a json file with over 16k urls of images, which I parse using a python script and use urllib.request.urlretrieve in it to retrieve images. I uploaded the json file to google drive and run the python script in google Colab.
Though the files were downloaded (I checked this using a print line in the try block of urlretrieve) and it took substantial time to download them, I am unable to see where it has stored these files. When I had run the same script on my local machine, it stored the files in the current folder.
As an answer to this question suggests, the files may be downloaded to some temporary location, say, on some cloud. Is there a way to dump these temporary files to google drive?
(*Note I had mounted the drive in the colab notebook, still the files don't appear to be stored in google drive)
Colab stores files in some temp location which is new every time you run the notebook. If you want your data to persist across sessions you need to store it in GDrive. For that you need to map some GDrive folder in your notebook and use it as path. Also, you need to give the Colab permissions to access your GDrive
After mounting GDrive you need to move files from the Colab to GDrive using command:
!mv /content/filename /content/gdrive/My\ Drive/

Download files from bucket using Google Cloud Client Library for Python

I tried using gsutil to download files in a bucket, but now would like to incorporate the download in a python script to automate the download process (for downloading specific days data). The following gsutil code worked fine.
gsutil -m cp -r gs://gcp-public-data-goes-16/GLM-L2-LCFA/2019/001 C:\dloadFiles
Using the storage client I have tried:
from google.cloud import storage
client = storage.Client()
with open('C:\dloadFiles') as file_obj:
client.download_blob_to_file(
'gs://gcp-public-data-goes-16/GLM-L2-LCFA/2019/001', file_obj)`
I get error "DefaultCredentialsError: Could not automatically determine credentials. Please set GOOGLE_APPLICATION_CREDENTIALS or explicitly create credentials and re-run the application. For more information, please see https://cloud.google.com/docs/authentication/getting-started"
This is a publicly available bucket.
You did not setup GOOGLE_APPLICATION_CREDENTIALS
Follow below link and setup credentials
https://stackoverflow.com/questions/45501082/set-google-application-credentials-in-python-project-to-use-google-api
After setting up credentials your code will work
After authenticating with your GCP credentials, you will also need to run:
gcloud auth application-default
To authenticate your application SDKs, such as your Python client libraries. Then you will be able to interact with GCP services via Python.
Also, you are copying a whole load of files with your gsutil command and not just one as you're doing with python. So you probably want to list_blobs first and then iteratively download them to files.
Also check out blob.download_to_file save you some coding (docs here). With that you can send a blob to a filename directly, without opening the file first.
First thing, turn off public on this bucket unless you really need the bucket to be public. For private access, you should use a service account (your code) or OAuth credentials.
If you are running this code in a Google Compute Service, credentials will be automatically discovered (ADC).
If you are running outside of Google Cloud, change this line:
client = storage.Client()
To this:
client = storage.Client().from_service_account_json('/full/path/to/service-account.json')
This line in your code is trying to open a directory. This is not correct. You need to specify a file name and not a directory name. You also need to specify write permission:
with open('C:\dloadFiles') as file_obj:
Change to
with open('c:/directory/myfilename', 'w')
Or for binary (data) files:
with open('c:/directory/myfilename', 'wb')
I am assuming that this path is a file blob and not a "simulated" folder on GCS. If this is a folder, you will need to change it to a file (storage object blob).
gs://gcp-public-data-goes-16/GLM-L2-LCFA/2019/001

How to run Python script on files in Google Cloud

I have a bunch of files in a Google Cloud Storage bucket, including some Python scripts and text files. I want to run the Python scripts on the text files. What would be the best way to go about doing this (App Engine, Compute Engine, Jupyter)? Thanks!
I recommend using Google Cloud Function, that can be triggered automatically each time you upload new file to the Cloud Storage to process it. You can see workflow for this in Cloud Function Storage Tutorial
You will need to at least download the python scripts onto an environment first (be it GCE or GAE). To access the GCS text files, you can use https://pypi.org/project/google-cloud-storage/ library. I don't think you can execute python scripts from the object bucket itself.
If it is troublesome to change the python codes for reading the text files from GCS, you will have to download everything into your environment (e.g. using gsutil)

Categories