I have a requirement to copy the file between two bucket detailed below -
Bucket A /folder A is source inbound box for daily files which are created as f1_abc_20210304_000 > I want to scan the latest file in folder A (10 files every day) and copy the latest file and next > Copy it in to Bucket B/Folder B / FILE name (ie from 10 files) / 2021/03/04 and drop the files in 04 folder.
Any suggestion how I should proceed with the design?
Thanks
RG
Did you want to do this copy task using Airflow?
If yes, Airflow provide GCSToGCSOperator
One approach is by using client libraries, for the example below I'm using the python client library for google cloud storage.
move.py
from google.cloud import storage
from google.oauth2 import service_account
import os
# as mention on https://cloud.google.com/docs/authentication/production
key_path = "credentials.json"
credentials = service_account.Credentials.from_service_account_file(key_path)
storage_client = storage.Client(credentials=credentials)
bucket_name = "source-bucket-id"
destination_bucket_name = "destination-bucket-id"
source_bucket = storage_client.bucket(bucket_name)
# prefix 'original_data' is the folder where i store the data
array_blobs = source_bucket.list_blobs(prefix='original_data')
filtered_dict = []
for blob in array_blobs:
if str(blob.name).endswith('.csv'):
#add additional logic to handle the files you want to ingest
filtered_dict.append({'name':blob.name,'time':blob.time_created})
orderedlist = sorted(filtered_dict, key=lambda d: d['time'], reverse=True)
latestblob = orderedlist[0]['name']
# prefix 'destination_data' is the folder where i want to move the data
destination_blob_name = "destination_data/{}".format(os.path.basename(latestblob))
source_blob = source_bucket.blob(latestblob)
destination_bucket = storage_client.bucket(destination_bucket_name)
blob_copy = source_bucket.copy_blob(source_blob, destination_bucket, destination_blob_name)
print(
"Blob {} in bucket {} copied to blob {} in bucket {}.".format(
source_blob.name,
source_bucket.name,
blob_copy.name,
destination_bucket.name,
)
)
For a bit of context on the code, what I did was to use the google cloud storage python client, log in, get the list of files from my source folder original_data inside bucket source-bucket-id and add the relevant files ( you can modify the pick up logic by adding your own criteria which fits your situation ). After that I pick up the latest files based on time creation and use that name to move it into my destination-bucket-id. As a note, destination_bucket_name variable includes the folder where I want to allocate the file and also the end filename.
UPDATE: I miss the airflow tag. So on that case you should use the operator that comes with google provider which is GCSToGCSOperator. The parameters to pass can be obtained using a python operator and pass it to your operator. It will work like this:
#task(task_id="get_gcs_params")
def get_gcs_params(**kwargs):
date = kwargs["next_ds"]
# logic should be as displayed on move.py
# ...
return {"source_objects":source,"destination_object":destination}
gcs_params = get_gcs_params()
copy_file = GCSToGCSOperator(
task_id='copy_single_file',
source_bucket='data',
source_objects= gcs_params.output['source_objects'],
destination_bucket='data_backup',
destination_object= gcs_params.output['destination_object'],
gcp_conn_id=google_cloud_conn_id
)
For additional guidance you can check the cloud storage examples list. I use Copy an object between buckets for guidance.
Related
I would like to write a script which will detect new files (csv files in this case) that have been added to a folder then upload these new files to my AWS S3 Bucket. I would like them to maintain the original name. Currently the script i have allows me to manually select a file then upload it with a name of my choice.
hc = pd.read_csv((open(r'CSV PATH')))
s3 = boto3.client('s3',aws_access_key_id = 'ACCESSKEYID',
aws_secret_access_key = 'ACCESSKEY')
csv_buf = StringIO()
hc.to_csv(csv_buf, header = True, index = False)
csv_buf.seek(0)
s3.put_object(Bucket = 'BucketName', Body = csv_buf.getvalue(), Key = 'Original CSV Name from Above')
I assume i need the following section in the code:
Code to monitor said location (but only when running the app - does not need to run 24/7)
Code to pull new file from said location
Code to upload to S3 Bucket
Any Tips?
How to download all files in a folder from GCS cloud bucket using python client api?
Files like .docx and .pdf
use a downloaded credentials file to create the client, see documentation
this docs tells you to export the file location, but I personally prefer the method used below as it allows for different credentials within the same application.
IMHO separation of what each serviceaccount can access increases security by tenfold. It's also usefull when dealing with different projects in the same app.
Note that you'll also have to give the serviceaccount the permission Storage Object Viewer, or one with more permissions.
Always use the least needed to due to security considerations
requirements.txt
google-cloud-storage
main.py
from google.cloud import storage
from os import makedirs
# use a downloaded credentials file to create the client, see
# https://cloud.google.com/storage/docs/reference/libraries#setting_up_authentication
# this docs tells you to export the file location, but I personally
# prefer the method used below as it allows for different credentials
# within the same application.
# IMHO separation of what each serviceaccount can access increases
# security by tenfold. It's also usefull when dealing with different
# projects in the same app.
#
#
# Note that you'll also have to give the serviceaccount the
# permission "Storage Object Viewer", or one with more permissions.
# Always use the least needed to due to security considerations
# https://cloud.google.com/storage/docs/access-control/iam-roles
cred_json_file_path = 'path/to/file/credentials.json'
client = storage.Client.from_service_account_json(cred_json_file_path)
def download_blob(bucket: storage.Bucket, remotefile: str, localpath: str='.'):
"""downloads from remotepath to localpath"""
localrelativepath = '/'.join(remotefile.split('/')[:-1])
totalpath = f'{localpath}/{localrelativepath}'
filename = f'{localpath}/{remotefile}'
makedirs(totalpath, exist_ok=True)
print(f'Current file details:\n remote file: {remotefile}\n local file: {filename}\n')
blob = storage.Blob(remotefile, bucket)
blob.download_to_filename(filename, client=client)
def download_blob_list(bucketname: str, bloblist: list, localpath: str='.'):
"""downloads a list of blobs to localpath"""
bucket = storage.Bucket(client, name=bucketname)
for blob in bloblist:
download_blob(bucket, blob, localpath)
def list_blobs(bucketname: str, remotepath: str=None, filetypes: list=[]) -> list:
"""returns a list of blobs filtered by remotepath and filetypes
remotepath and filetypes are optional"""
result = []
blobs = list(client.list_blobs(bucketname, prefix=remotepath))
for blob in blobs:
name = str(blob.name)
# skip "folder" names
if not name.endswith('/'):
# do we need to filter file types?
if len(filetypes) > 0:
for filetype in filetypes:
if name.endswith(filetype):
result.append(name)
else:
result.append(name)
return result
bucketname = 'bucketnamegoeshere'
foldername = 'foldernamegoeshere'
filetypes = ['.pdf', '.docx'] # list of extentions to return
bloblist = list_blobs(bucketname, remotepath=foldername, filetypes=filetypes)
# I'm just using the bucketname for localpath for download location.
# should work with any path
download_blob_list(bucketname, bloblist, localpath=bucketname)
I just want to know that because of some the invocations are working and some are not. I couldn't find any official documentation regarding time delay or restriction in quotas provided by Google.
For example: I created a dataframe with the following code:
empty_df = pd.DataFrame(val)
empty_df.to_csv('/tmp/{}.csv'.format(SAMPLE))
storage_client = storage.Client()
bucket = storage_client.get_bucket(bucket_name)
blob = bucket.blob('FOLDER1/{}.csv'.format(SAMPLE))
blob.upload_from_filename('/tmp/{}.csv'.format(SAMPLE))
The SAMPLE variable that I defined in the code has been changing on every loop. I ran it in for loop and also Cloud Functions triggered multiple times (between 1 up to 50, or more than 50). Until this point everything looks fine. After the function is completed I can not see some of the CSV files in the 'FOLDER1' folder. And I also have the same problem in the copy_blob function.
For example: I want to move CSV files from FOLDER1 to FOLDER2 with new name that I created with above codes. Some of the CSV files are not appearing in FOLDER2 and also sending me 404 not found files error in logs. But when I manually checked the buckets, I can see the files there.
def copy_blob(
bucket_name, blob_name, destination_bucket_name, destination_blob_name,
status_path, delete_blob = None
):
"""Copies a blob from one bucket to another with a new name."""
storage_client = storage.Client()
source_bucket = storage_client.bucket(bucket_name)
source_blob = source_bucket.blob(blob_name)
destination_bucket = storage_client.bucket(destination_bucket_name)
blob_copy = source_bucket.copy_blob(
source_blob, destination_bucket, destination_blob_name
)
#delete old blob
if delete_blob == True :
source_blob.delete()
else:
pass
print(
"Blob {} in bucket {} copied to blob {} in bucket {}.".format(
source_blob.name,
source_bucket.name,
blob_copy.name,
destination_bucket.name,
)
)
I used to that code to move the files. Does anyone have an idea?
I want to copy data from one bucket to another bucket using google cloud function. At this time I am able to copy only a single file to destination but I want to copy all files, folders, and sub-folders to my destination bucket.
from google.cloud import storage
def copy_blob(bucket_name= "loggingforproject", blob_name= "assestnbfile.json", destination_bucket_name= "test-assest", destination_blob_name= "logs"):
"""Copies a blob from one bucket to another with a new name."""
bucket_name = "loggingforproject"
blob_name = "assestnbfile.json"
destination_bucket_name = "test-assest"
destination_blob_name = "logs"
storage_client = storage.Client()
source_bucket = storage_client.bucket(bucket_name)
source_blob = source_bucket.blob(blob_name)
destination_bucket = storage_client.bucket(destination_bucket_name)
blob_copy = source_bucket.copy_blob(
source_blob, destination_bucket, destination_blob_name
)
print(
"Blob {} in bucket {} copied to blob {} in bucket {}.".format(
source_blob.name,
source_bucket.name,
blob_copy.name,
destination_bucket.name,
)
)
Using gsutil cp is a good option. However, if you want to copy the files using Cloud Functions - it can be achieved as well.
At the moment, your function only copies a single file. In order to copy the whole content of your bucket you would need to iterate through the files within it.
Here is a code sample that I wrote for an HTTP Cloud Function and tested - you can use it for a reference:
MAIN.PY
from google.cloud import storage
def copy_bucket_files(request):
"""
Copies the files from a specified bucket into the selected one.
"""
# Check if the bucket's name was specified in the request
if request.args.get('bucket'):
bucketName = request.args.get('bucket')
else:
return "The bucket name was not provided. Please try again."
try:
# Initiate Cloud Storage client
storage_client = storage.Client()
# Define the origin bucket
origin = storage_client.bucket(bucketName)
# Define the destination bucket
destination = storage_client.bucket('<my-test-bucket>')
# Get the list of the blobs located inside the bucket which files you want to copy
blobs = storage_client.list_blobs(bucketName)
for blob in blobs:
origin.copy_blob(blob, destination)
return "Done!"
except:
return "Failed!"
REQUIREMENTS.TXT
google-cloud-storage==1.22.0
How to call that function:
It can be called via the URL provided for triggering the function, by appending that URL with /?bucket=<name-of-the-bucket-to-copy> (name without <, >):
https://<function-region>-<project-name>.cloudfunctions.net/<function-name>/?bucket=<bucket-name>
You can use the gsutil cp command for this:
gsutil cp gs://first-bucket/* gs://second-bucket
See https://cloud.google.com/storage/docs/gsutil/commands/cp for more details
Here is my typescript code, I call it from my website when a need to move images.
exports.copiarImagen = functions.https.onCall(async (data, response) => {
var origen = data.Origen;
var destino = data.Destino;
console.log('Files:');
const [files] = await admin.storage().bucket("bucketĀ“s path").getFiles({ prefix: 'path where your images are"});
files.forEach(async file => {
var nuevaRuta = file.name;
await admin.storage().bucket("posavka.appspot.com").file(file.name)
.copy(admin.storage().bucket("posavka.appspot.com").file(nuevaRuta.replace(origen,destino)));
await admin.storage().bucket("posavka.appspot.com").file(file.name).delete();
});
First I get all files in a specific path, then I copy those files to the new path, and finally I delete the file on the old path
I hope it helps you :D
I have a S3 server with millions of files under each bucket. I want to download files from a bucket, but to download only files that meet a particular condition.
Is there a better way than getting all bucket and then checking the particular condition while iterating over the files?
As can be seen here:
import os
# Import the SDK
import boto
from boto.s3.connection import OrdinaryCallingFormat
LOCAL_PATH = 'W:/RD/Fancy/s3_opportunities/'
bucket_name = '/recording'#/sampledResponseLogger'
# connect to the bucket
print 'Connecting...'
conn = boto.connect_s3(calling_format=OrdinaryCallingFormat()) #conn = boto.connect_s3()
print 'Getting bucket...'
bucket = conn.get_bucket(bucket_name)
print 'Going through the list of files...'
bucket_list = bucket.list()
for l in bucket_list:
keyString = str(l.key)
# SOME CONDITION
if('2015-08' in keyString):
# check if file exists locally, if not: download it
filename=LOCAL_PATH+keyString[56:]
if not os.path.exists(filename):
print 'Downloading file: ' + keyString + '...'
# Download the object that the key represents
l.get_contents_to_filename(filename)
The only mechanism available for filtering ListBucket operations on the server side is the prefix. So, if your objects in S3 have some sort of an implied directory structure (e.g. foo/bar/fie/baz/object1) then you can use the prefix to list only the objects that start with, for example, foo/bar/fie. If your object names do not display this hierarchical naming, there really isn't anything you can do except list all of the objects and filter using your own mechanism.