Python - download entire directory from Google Cloud Storage - python

At the following page
https://googlecloudplatform.github.io/google-cloud-python/latest/storage/blobs.html
there are all the API calls which can be used for Python & Google Cloud storage. Even in the "official" samples on github
https://github.com/GoogleCloudPlatform/python-docs-samples/blob/master/storage/cloud-client/snippets.py
don't have a related example.
Finally, downloading a directory with the same method used for download files gives the error
Error: [Errno 21] Is a directory:

You just have to first list all the files in a directory and then download them one by one:
bucket_name = 'your-bucket-name'
prefix = 'your-bucket-directory/'
dl_dir = 'your-local-directory/'
storage_client = storage.Client()
bucket = storage_client.get_bucket(bucket_name=bucket_name)
blobs = bucket.list_blobs(prefix=prefix) # Get list of files
for blob in blobs:
filename = blob.name.replace('/', '_')
blob.download_to_filename(dl_dir + filename) # Download
blob.name includes the entire directory structure + filename, so if you want the same file name as in the bucket, you might want to extract it first (instead of replacing / with _)

If you want to keep the same directory structure without renaming and also create nested folders. I have for python 3.5+ a solution based on #ksbg answer :
from pathlib import Path
bucket_name = 'your-bucket-name'
prefix = 'your-bucket-directory/'
dl_dir = 'your-local-directory/'
storage_client = storage.Client()
bucket = storage_client.get_bucket(bucket_name=bucket_name)
blobs = bucket.list_blobs(prefix=prefix) # Get list of files
for blob in blobs:
if blob.name.endswith("/"):
continue
file_split = blob.name.split("/")
directory = "/".join(file_split[0:-1])
Path(directory).mkdir(parents=True, exist_ok=True)
blob.download_to_filename(blob.name)

Lets say, we want to download FINALFOLDER from the storage path: gs://TEST_BUCKET_NAME/FOLDER1/FOLDER2/FINALFOLDER
After downloading, the final path will look like: D:\\my_blob_data\FINALFOLDER
from os import makedirs
from os.path import join, isdir, isfile, basename
from google.cloud import storage
# if your environment was authenticated, the default config will be picked up
storage_client = storage.Client() # comment this line if you want to use service account
# uncomment the line below if you have a service account json
# storage_client = storage.Client.from_service_account_json('creds/sa.json')
bucket_name = 'TEST_BUCKET_NAME'
prefix = 'FOLDER2'
dst_path = 'D:\\my_blob_data'
if isdir(dstPath) == False:
makedirs(dstPath)
bucket = storage_client.bucket(bucket_name=bucket_name)
blobs = bucket.list_blobs(prefix=prefix) # Get list of files
for blob in blobs:
blob_name = blob.name
dst_file_name = blob_name.replace('FOLDER1/FOLDER2', dst_path) #.replace('FOLDER1/FOLDER2', 'D:\\my_blob_data')
# extract the final directory and create it in the destination path if it does not exist
dst_dir = dst_file_name.replace('/' + basename(dst_file_name), '')
if isdir(dst_dir) == False:
makedirs(dst_dir)
# download the blob object
blob.download_to_filename(dst_file_name)

Using tensoflow gfile package, here is a recursive function.
root_dir is the GCS parent folder.
local_base_dir is the parent folder created at local
def copy_recursively(root_dir, local_base_dir):
if tf.io.gfile.exists(local_base_dir):
tf.io.gfile.rmtree(local_base_dir)
tf.io.gfile.mkdir(local_base_dir)
file_list = tf.io.gfile.glob(root_dir+'/**')
for item in file_list:
if not tf.io.gfile.isdir(item):
fname = item.rsplit('/',1)[-1]
if not fname.startswith('.'):
tf.io.gfile.copy(item,
os.path.join(local_base_dir,fname),
overwrite=False)
else:
child_dir= item.rsplit('/',1)[-1]
full_dir_path = os.path.join(local_base_dir,child_dir)
print(f"Setting up child directory: {full_dir_path}")
copy_recursively(item,full_dir_path)
root_dir = 'gs://.../.../..'
local_base_dir = root_dir.rsplit('/',1)[-1]
copy_recursively(root_dir, local_base_dir)

Refer This Link- https://medium.com/#sandeepsinh/multiple-file-download-form-google-cloud-storage-using-python-and-gcs-api-1dbcab23c44
1 - Add Your Credential Json
2 - List Bucket Items
3 - Download
import logging
import os
from google.cloud import storage
global table_id
global bucket_name
logging.basicConfig(format=’%(levelname)s:%(message)s’, level=logging.DEBUG)
bucket_name = ‘mybucket’
table_id = ‘shakespeare’
storage_client = storage.Client.from_service_account_json(‘/google-cloud/keyfile/service_account.json’)
# The “folder” where the files you want to download are
folder=’/google-cloud/download/{}’.format(table_id)
delimiter=’/’
bucket=storage_client.get_bucket(bucket_name)
blobs=bucket.list_blobs(prefix=table_id, delimiter=delimiter) #List all objects that satisfy the filter.
# Download the file to a destination
def download_to_local():
logging.info(‘File download Started…. Wait for the job to complete.’)
# Create this folder locally if not exists
if not os.path.exists(folder):
os.makedirs(folder)
# Iterating through for loop one by one using API call
for blob in blobs:
logging.info(‘Blobs: {}’.format(blob.name))
destination_uri = ‘{}/{}’.format(folder, blob.name)
blob.download_to_filename(destination_uri)
logging.info(‘Exported {} to {}’.format(
blob.name, destination_uri))
if __name__ == ‘__main__’:
download_to_local()

Related

copy files between s3 buckets using python

I have a excel sheet with metadata with 3 fields (path,folder_structure,filename)
Path: it is the path of source file in s3 source bucket
folder_structure: new folder structure that need to be created in Target bucket
filename: this is the filename that need to be renamed after copying to target bucket
I have below code working in windows source folder and creating target folder and copying data to target folder. need to modify this to source from s3 bucket and load it another s3 bucket.
code:
import pandas as pd
import os,shutil
from pathlib import Path
data = pd.read_excel('c:\data\sample_requirement.xlsx',engine='openpyxl')
root_dir = 'source'
for rec in range(len(data)):
#Replacing the '|' symbol with backward slash
dire = data['folder_structure'][rec].replace('|','\\')
#appending root directory with folder structure
directory = root_dir+'\\'+dire
#print(directory)
#Checking if path exists, if exit-> skip else-> create new
if not os.path.exists(directory):
#print('Not exist')
#creating new directory
os.makedirs(directory)
#Path in the excel
path = data['path'][rec]
#Filenames to change
filename = data['filename'][rec]
#print(filename)
if not os.path.isfile(directory + filename) :
#Copying the files to created path
shutil.copy(path,directory)
#Renaming the files
try:
os.rename(directory + os.path.basename(path),directory + filename)
except FileExistsError as e:
print('File Name already Exists')
How about this just add this to your code replace your target and destination bucket:-
import boto3
s3 = boto3.resource('s3')
copy_source = {
'Bucket': 'yoursourcebucket',
'Key': 'yourkey'
}
s3.meta.client.copy(copy_source, 'nameofdestinationbucket', 'destinationkey')
its a good practice to follow docs for knowing the details of the code, also note there maybe many other ways too to perform the same operation for example using awscli https://stackoverflow.com/a/32526487/13126651
Copy one file from a bucket to another :
s3 = boto3.client("s3")
s3.copy({"Bucket": SOURCE_BUCKET, "Key": SOURCE_KEY}, DESTINATION_BUCKET, DESTINATION_KEY)
Copy up to 1000 files a bucket to another :
s3 = boto3.client("s3")
response = s3.list_objects_v2(
Bucket=self.bucket,
Prefix=self.path,
) # Warning not handling pagination -> will truncate after 1000 keys
for file in response['Contents']:
s3.copy(
{"Bucket": SOURCE_BUCKET, "Key": file['Key']},
DESTINATION_BUCKET,
"/".join(
DESTINATION_PREFIX,
file['KEY']
),
)
Copy more than 1000 files :
properly handle pagination when call list_objects_v2
loop until response['IsTruncated']
use response['NextContinuationToken'] as the new 'ContinuationToken' arg

Upload folder in Blob Storage with SAS URI

I have to upload a set of folders into a dedicated container in Azure Blob Storage.
I found this:
https://github.com/rahulbagal/upload-file-azure-sas-url
but it is just for uploading a file using a dedicated Blob SAS URI, and it works perfectly.
Is there any similar solution able to manage folder upload instead of a file upload?
Thank you in advanced
1. Please try to use this code:
import os
from azure.storage.blob import BlobServiceClient
account_url = "https://<storage-account-name>.blob.core.windows.net/"
sas_token = "<your-sas-token>"
blob_service_client = BlobServiceClient(account_url, sas_token)
container_name = "<your-container-name>"
container_client = blob_service_client.get_container_client(container_name)
local_path = "<your-folder-path>"
folder_name = "<your-folder-name>"
for files in os.listdir(local_path):
with open(os.path.join(local_path,files), "rb") as data:
blob_client = blob_service_client.get_blob_client(container=container_name, blob= folder_name + "/" + files)
blob_client.upload_blob(data)
2. Or you can use azcopy to upload your folder:
For example:
azcopy copy '<folder-path>' 'https://<account-name>.blob.core.windows.net/<container-name>?<sas-token>' --recursive
For more details, you can refer to this official documentation.

Unzip Password Protected Zip file automatically from azure storage?

I'm just wondering is there a way to extract a password protected zip file from Azure Storage.
I tried using a python Azure Function to no avail but had a problem reading the location of the file.
Would the file have to stored on a shared location temporarily in order to achieve?
Just looking for a bit of direction here am I missing a step maybe?
Regards,
James
Azure blob storage provides storing functionality only, there is no running env to perform unzip operation. So basically, we should download .zip file to Azure function, unzip it and upload files in .zip file 1 by 1.
For a quick test, I write an HTTP trigger Azure function demo that unzipping a zip file with password-protected, it works for me on local :
import azure.functions as func
import uuid
import os
import shutil
from azure.storage.blob import ContainerClient
from zipfile import ZipFile
storageAccountConnstr = '<storage account conn str>'
container = '<container name>'
#define local temp path, on Azure, the path is recommanded under /home
tempPathRoot = 'd:/temp/'
unZipTempPathRoot = 'd:/unZipTemp/'
def main(req=func.HttpRequest) -> func.HttpResponse:
reqBody = req.get_json()
fileName = reqBody['fileName']
zipPass = reqBody['password']
container_client = ContainerClient.from_connection_string(storageAccountConnstr,container)
#download zip file
zipFilePath = tempPathRoot + fileName
with open(zipFilePath, "wb") as my_blob:
download_stream = container_client.get_blob_client(fileName).download_blob()
my_blob.write(download_stream.readall())
#unzip to temp folder
unZipTempPath = unZipTempPathRoot + str(uuid.uuid4())
with ZipFile(zipFilePath) as zf:
zf.extractall(path=unZipTempPath,pwd=bytes(zipPass,'utf8'))
#upload all files in temp folder
for root, dirs, files in os.walk(unZipTempPath):
for file in files:
filePath = os.path.join(root, file)
destBlobClient = container_client.get_blob_client(fileName + filePath.replace(unZipTempPath,''))
with open(filePath, "rb") as data:
destBlobClient.upload_blob(data,overwrite=True)
#remove all temp files
shutil.rmtree(unZipTempPath)
os.remove(zipFilePath)
return func.HttpResponse("done")
Files in my container:
Result:
Using blob triggers will be better to do this as it will cause time-out errors if the size of your zip file is huge.
Anyway, this is only a demo that shows you how to do this.

move files between directories in single GCP bucket

I am trying to set up cloud functions to move files between folders inside one bucket in GCP.
Whenever the user loads files into the provided bucket folder, my cloud functions move the file to another folder where big data scripts are looking after.
It shows successful while setting up, however, files are not moving from the source folders.
Appreciate your help
from google.cloud import storage
def move_file(bucket_name, bucket_Folder, blob_name):
"""Moves a blob from one folder to another with the same name."""
bucket_name = 'bucketname'
blob_name = 'filename'
storage_client = storage.Client()
bucket = storage_client.get_bucket(bucket_name)
source_blob = bucket.blob("Folder1/" + blob_name)
new_blob = bucket.copy_blob(source_blob, bucket, "Folder2/" + blob_name)
blob.delete()
print('Blob {} in bucket {} copied to blob {} .'.format(source_blob.name, bucket.name, new_blob.name))
From the code you provided, the variable blob is not defined anywhere, so the source file won't be deleted. Instead of blob.delete(), change that line to source_blob.delete().
Also, I assume you are aware that you're "moving" just a single file. If you want to move all files prefixed with Folder1/ to Folder2 you could do something like this instead:
from google.cloud import storage
def move_files(self):
storage_client = storage.Client()
bucket = storage_client.get_bucket('bucketname')
blobs = bucket.list_blobs(prefix='Folder1/')
for blob in blobs:
bucket.rename_blob(blob, new_name=blob.name.replace('Folder1/', 'Folder2/'))
For the latter, I reckon that there could be more efficient or better ways to do it.
If you are just moving the object inside of the same bucket you can just rename the object with the desired route.
In Google Cloud Platform Storage there are no folders, just the illusion of them. Everything after the name of the bucket is part of the name of the object.
Also, I can see many errors in your function. You can use this generic function to move a blob from one folder to another inside of the same bucket:
from google.cloud import storage
def rename_blob(bucket_name, blob_name, new_name):
"""Renames a blob."""
# bucket_name = "your-bucket-name"
# blob_name = "folder/myobject"
# new_name = "newfolder/myobject"
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(blob_name)
new_blob = bucket.rename_blob(blob, new_name)
print("Blob {} has been renamed to {}".format(blob.name, new_blob.name))

Remove absolute path while uploading file on S3

I am trying to upload file on S3 in my bucket, using following code which is working absolutely fine.
#!/usr/bin/python
import os
import boto
import boto.s3.connection
from boto.s3.key import Key
from boto.s3.connection import S3Connection
from datetime import datetime
try:
conn = boto.s3.connect_to_region('us-east-1',
aws_access_key_id = 'AKXXXXXXXXXXXXA',
aws_secret_access_key = 'cXXXXXXXXXXXXXXXXXXXXXXXXX2',
calling_format = boto.s3.connection.OrdinaryCallingFormat(),
)
print conn
filename = '/home/rahul/GitDjangopostgres/AWSNewView/company/ExtraPaymentDocuments/Argus_ATM_e_surveillance/Gujarat/Ahmedabad/P3DCAM04/cecb834f-ae85-49e3-b8a1-a6nC^U)3GcZ)M62d643aa7-d047-498c-bf59-8__Invoice (7).pdf'
bucket = conn.get_bucket('bucketName', validate=False)
key_name = filename
print "file to upload",key_name
secure_https_url = 'https://{host}/{bucket}{key}'.format(
host=conn.server_name(),
bucket='bucketName',
key=key_name)
print "secure_https_url",secure_https_url
k = bucket.new_key(key_name)
mp = k.set_contents_from_filename(key_name)
print "File uploaded successfully"
except Exception,e:
print str(e)
print "error"
Now the problem is as my file name is '/home/rahul/GitDjangopostgres/AWSNewView/company/ExtraPaymentDocuments/Argus_ATM_e_surveillance/Gujarat/Ahmedabad/P3DCAM04/cecb834f-ae85-49e3-b8a1-a6nC^U)3GcZ)M62d643aa7-d047-498c-bf59-8__Invoice (7).pdf', It's creating hierarchical bucket and storing my file. so I am getting file path as https://s3.amazonaws.com/bucketName/home/rahul/GitDjangopostgres/AWSNewView/company/ExtraPaymentDocuments/Argus_ATM_e_surveillance/Gujarat/Ahmedabad/P3DCAM04/cecb834f-ae85-49e3-b8a1-a6nC^U)3GcZ)M62d643aa7-d047-498c-bf59-8__Invoice (7).pdf.I want to change this hierarchy to https://s3.amazonaws.com/bucketName/ExtraPaymentDocuments/Argus_ATM_e_surveillance/Gujarat/Ahmedabad/P3DCAM04/cecb834f-ae85-49e3-b8a1-a6nC^U)3GcZ)M62d643aa7-d047-498c-bf59-8__Invoice (7).pdf. Is there any option to do it with boto OR should I go with python, because to upload file on s3 it requires absolute path of the file, As I am using Django and this is my celery task.
The function set_contents_from_filename(key_name) understand that whatever is key name it will put it in s3 as it is. If your key name contains any / then it will create hierarchy structure. For your situation i suggest that you create two paths. One is your base local path which contains your files. Base local path will contain all your files that you want to upload to s3 (use os.path.join for creating the path) Then next is aws path which is the hierarchy you want to create in your s3 bucket. As an example you can declare the aws path as:
/ExtraPaymentDocuments/Argus_ATM_e_surveillance/Gujarat/Ahmedabad/P3DCAM04
Then you append the local filename to the parameter key_name which you will pass to function new_key.
Boto will work fine here.
example:
create a s3 path that you want in your s3 storage and also add a base local path which contain all the files that you want to upload. filename has to be appended to create a file. new_key function will create a key i.e. path that you can use to store your files. set_contents_from_filename function will take a local file path and store this file to s3 with the key (path) provided in the function above.
k = bucket.new_key(s3_path + filename)
mp = k.set_contents_from_filename(base_local_path _ filename)

Categories