Copy a large amount of files in s3 on the same bucket - python

I got a "directory" on a s3 bucket with 80 TB ~ and I need do copy everything to another directory in the same bucket
source = s3://mybucket/abc/process/
destiny = s3://mybucket/cde/process/
I already tried to use aws s3 sync, but worked only for the big files, still left 50 TB to copy. I'm thinking about to use a boto3 code as this example below, but I don't know how to do for multiple files/directories recursively.
s3 = boto3.resource('s3')
copy_source = {
'Bucket': 'mybucket',
'Key': 'mykey'
}
s3.meta.client.copy(copy_source, 'otherbucket', 'otherkey')
How can I do this using the boto3?

While there may be better ways of doing this using bucket policies, it can be done using boto3.
first you will need to get a list of the contents of the bucket
bucket_items = s3_client.list_objects_v2(Bucket=source_bucket,Prefix=source_prefix)
bucket_contents = bucket_items.get('Contents',[])
Where source_bucket is the name of the bucket and source_prefix is the name of the folder.
Next you will iterate over the contents and for each item call the s3.meta.client.copy method like so
for content in bucket_contents:
copy_source = {
'Bucket': source_bucket,
'Key': content['Key']
}
s3.meta.client.copy(copy_source, source_bucket, destination_prefix + '/' + content['Key'].split('/')[-1])
the contents are a dictionary so you must use 'Key' to get the name of the item and use split to break it into prefix and file name.

Related

AWS Lambda to delete everything from a specific folder in an S3 bucket

I'm trying to delete everything from a specific folder in an S3 bucket with AWS Lambda using Python. The Lambda runs successfully however, the files still exist in "folder1". There will be no sub-folder under this folder except files.
Could someone please assist? Here is the code:
import json
import os
import boto3
def lambda_handler(event,context):
s3 = boto3.resource('s3')
deletefile_bucket = s3.Bucket('test_bucket')
response = deletefile_bucket.delete_objects(
Delete={
'Objects': [
{
'Key': 'folder1/'
},
],
}
)
The delete_objects() command requires a list of object keys to delete. It does not perform wildcard operations and it does not delete the contents of subdirectories.
You will need to obtain a listing of all objects and then specifically request those objects to be deleted.
The delete_objects() command accepts up to 1000 objects to delete.

Delete files under S3 bucket recursively without deleting folders using python

I'm getting error, When i try to delete all files under specific folder
Problem is here ['Key': 'testpart1/.']
Also i would like to delete 30 days older file, please help me with script
import boto3
s3 = boto3.resource('s3')
my_bucket = s3.Bucket('my-bucket')
response = my_bucket.delete_objects(
Delete={
'Objects': [
{
'Key': 'testpart1/*.*' # the_name of_your_file
}
]
}
The code below will delete all files under the prefix recursively:
import boto3
s3 = boto3.resource('s3')
my_bucket = s3.Bucket('my-bucket')
response = my_bucket.objects.filter(Prefix="testpart1/").delete()
Please check https://stackoverflow.com/a/59146547/4214976 to filter out the object based on date.

AWS Lambda - copy object to another S3 location

I'd like to write a lambda python code to move files to the same S3 bucket.
[same S3 bucket]
/location-as-is/file.jpg
[same S3 bucket]
/location-to-be/file.jpg
How can I do that?
Thank you.
In order to get this to work you will need a few things. First is the lambda code itself. You should be able to use the python sdk boto3 to make the call to copy. Here is an example how to copy your file:
import json
import boto3
s3 = boto3.resource('s3')
def lambda_handler(event, context):
my_bucket = "example-bucket"
current_object_key = "fileA/keyA.jpg"
new_object_key = "fileB/keyB.jpg"
copy_source = {
'Bucket': my_bucket,
'Key': current_object_key
}
s3.meta.client.copy(copy_source, my_bucket, new_object_key)
You will also need to make sure you lambda execution role has proper s3 read and write permissions and that your s3 bucket policy is configured to allow your lambda role to access it.
You can use boto for this purpose, as below:
import boto
c = boto.connect_s3()
src_buc = c.get_bucket('Source_Bucket')
sink_buc = c.get_bucket('Sink_Bucket')
and then you can iterate over all your keys to copy the content:
for k in src_buc.list():
# copy to sink
sink_buc.copy_key(k.key.name, src_buc.name, k.key.name)

Google cloud function to copy all data of source bucket to another bucket using python

I want to copy data from one bucket to another bucket using google cloud function. At this time I am able to copy only a single file to destination but I want to copy all files, folders, and sub-folders to my destination bucket.
from google.cloud import storage
def copy_blob(bucket_name= "loggingforproject", blob_name= "assestnbfile.json", destination_bucket_name= "test-assest", destination_blob_name= "logs"):
"""Copies a blob from one bucket to another with a new name."""
bucket_name = "loggingforproject"
blob_name = "assestnbfile.json"
destination_bucket_name = "test-assest"
destination_blob_name = "logs"
storage_client = storage.Client()
source_bucket = storage_client.bucket(bucket_name)
source_blob = source_bucket.blob(blob_name)
destination_bucket = storage_client.bucket(destination_bucket_name)
blob_copy = source_bucket.copy_blob(
source_blob, destination_bucket, destination_blob_name
)
print(
"Blob {} in bucket {} copied to blob {} in bucket {}.".format(
source_blob.name,
source_bucket.name,
blob_copy.name,
destination_bucket.name,
)
)
Using gsutil cp is a good option. However, if you want to copy the files using Cloud Functions - it can be achieved as well.
At the moment, your function only copies a single file. In order to copy the whole content of your bucket you would need to iterate through the files within it.
Here is a code sample that I wrote for an HTTP Cloud Function and tested - you can use it for a reference:
MAIN.PY
from google.cloud import storage
def copy_bucket_files(request):
"""
Copies the files from a specified bucket into the selected one.
"""
# Check if the bucket's name was specified in the request
if request.args.get('bucket'):
bucketName = request.args.get('bucket')
else:
return "The bucket name was not provided. Please try again."
try:
# Initiate Cloud Storage client
storage_client = storage.Client()
# Define the origin bucket
origin = storage_client.bucket(bucketName)
# Define the destination bucket
destination = storage_client.bucket('<my-test-bucket>')
# Get the list of the blobs located inside the bucket which files you want to copy
blobs = storage_client.list_blobs(bucketName)
for blob in blobs:
origin.copy_blob(blob, destination)
return "Done!"
except:
return "Failed!"
REQUIREMENTS.TXT
google-cloud-storage==1.22.0
How to call that function:
It can be called via the URL provided for triggering the function, by appending that URL with /?bucket=<name-of-the-bucket-to-copy> (name without <, >):
https://<function-region>-<project-name>.cloudfunctions.net/<function-name>/?bucket=<bucket-name>
You can use the gsutil cp command for this:
gsutil cp gs://first-bucket/* gs://second-bucket
See https://cloud.google.com/storage/docs/gsutil/commands/cp for more details
Here is my typescript code, I call it from my website when a need to move images.
exports.copiarImagen = functions.https.onCall(async (data, response) => {
var origen = data.Origen;
var destino = data.Destino;
console.log('Files:');
const [files] = await admin.storage().bucket("bucketĀ“s path").getFiles({ prefix: 'path where your images are"});
files.forEach(async file => {
var nuevaRuta = file.name;
await admin.storage().bucket("posavka.appspot.com").file(file.name)
.copy(admin.storage().bucket("posavka.appspot.com").file(nuevaRuta.replace(origen,destino)));
await admin.storage().bucket("posavka.appspot.com").file(file.name).delete();
});
First I get all files in a specific path, then I copy those files to the new path, and finally I delete the file on the old path
I hope it helps you :D

Boto3, s3 folder not getting deleted

I have a directory in my s3 bucket 'test', I want to delete this directory.
This is what I'm doing
s3 = boto3.resource('s3')
s3.Object(S3Bucket,'test').delete()
and getting response like this
{'ResponseMetadata': {'HTTPStatusCode': 204, 'HostId':
'************', 'RequestId': '**********'}}
but my directory is not getting deleted!
I tried with all combinations of '/test', 'test/' and '/test/' etc, also with a file inside that directory and with empty directory and all failed to delete 'test'.
delete_objects enables you to delete multiple objects from a bucket using a single HTTP request. You may specify up to 1000 keys.
https://boto3.readthedocs.io/en/latest/reference/services/s3.html#S3.Bucket.delete_objects
import boto3
s3 = boto3.resource('s3')
bucket = s3.Bucket('my-bucket')
objects_to_delete = []
for obj in bucket.objects.filter(Prefix='test/'):
objects_to_delete.append({'Key': obj.key})
bucket.delete_objects(
Delete={
'Objects': objects_to_delete
}
)
NOTE: See Daniel Levinson's answer for a more efficient way of deleting multiple objects.
In S3, there are no directories, only keys. If a key name contains a / such as prefix/my-key.txt, then the AWS console groups all the keys that share this prefix together for convenience.
To delete a "directory", you would have to find all the keys that whose names start with the directory name and delete each one individually. Fortunately, boto3 provides a filter function to return only the keys that start with a certain string. So you can do something like this:
s3 = boto3.resource('s3')
bucket = s3.Bucket('my-bucket-name')
for obj in bucket.objects.filter(Prefix='test/'):
s3.Object(bucket.name, obj.key).delete()

Categories