python boto3 for AWS - S3 Bucket Sync optimization - python

currently I am trying to compare two S3 buckets with the target to delete files.
problem defintion:
-BucketA
-BucketB
The script is looking for files (same key name) in BucketB which are not available in BucketA.
That files which are only available in BucketB have to be deleted.
The buckets contain about 3-4 Million files each.
Many thanks.
Kind regards,
Alexander
My solution idea:
The List filling is quite slow. Is there any possibility to accelerate it?
#Filling the lists
#e.g. BucketA (BucketB same procedure)
s3_client = boto3.client(3")
bucket_name = "BucketA"
paginator = s3_client.get_paginator("list_objects_v2")
response = paginator.paginate(Bucket="BucketA", PaginationConfig={"PageSize": 2})
for page in response:
files = page.get("Contents")
for file in files:
ListA.append(file['Key'])
#finding the deletion values
diff = list(set(BucketB) - set(BucketA))
#delete files from BucketB (building junks, since with delete_objects_from_bucket max. 1000 objects at once)
for delete_list in group_elements(diff , 1000):
delete_objects_from_bucket(delete_list)

The ListBucket() API call only returns 1000 objects at a time, so listing buckets with 100,000+ objects is very slow and best avoided. You have 3-4 million objects, so definitely avoid listing them!
Instead, use Amazon S3 Inventory, which can provide a daily or weekly CSV file listing all objects in a bucket. Activate it for both buckets, and then use the provided CSV files as input to make the comparison.
You can also use the CSV file generated by Amazon S3 Inventory as a manifest file for Amazon S3 Batch Operations. So, your code could generate a file that lists only the objects that you would like deleted, then get S3 Batch Operations to process those deletions.

Related

how to get & iterate through the data from s3 bucket with date range in python

I have the s3 bucket which will have the data in files with name as the date.
I was able to fetch the data on a single day by giving the date as an input for prefix value since the data will be in that file.
input: 2022/10/15
import boto3
s3 = boto3.client('s3', access_key, secret_key)
response = s3.list_objects(Bucket= bucket,Prefix = input )
print(response)
But i want to fetch the data with date range. How can i change this code works for that scenario.
for example if i give the input date as 2022/10/01 , i want to fetch the data from 2022/10/01 to today.
how can i iterate over the dates and fetch the data for all files under 2022/10/01, 2022/10/02 .... to today.
The list_objects_v2() method supports a StartAfter parameter:
StartAfter (string) -- StartAfter is where you want Amazon S3 to start listing from. Amazon S3 starts listing after this specified key. StartAfter can be any key in the bucket.
So, you could use StartAfter to commence your listing at the name of the first directory, and then receive a list of all objects after that key. Since the folders are named with dates, they will already be sorted in correct order. Just keep reading the file list until the Key no longer matches the folder-naming standard.
Rather than listing the contents of each folder, you are listing the contents of the bucket. But, that's the same result since Amazon S3 does not actually use folders.
Please note that list_objects_v2() only returns 1000 objects per call, so it might be necessary to loop through the result set using ContinuationToken.

why I cannot get the whole list of files so that the contents in s3 bucket by using python?

I am trying to get contents of files in s3 and for that at first I am getting the list of the files from different folders/subfolders for which I will get the contents. However, I have realized that my method does not give me all the files in that bucket and it only reads less than the half of the files in the folders/subfolders and I am not sure what I am doing wrong. Here is my code:
def get_s3_list(bucket, prefix):
s3 = boto3.client("s3")
objects = s3.list_objects_v2(Bucket=bucket, Prefix=prefix)
I think the part where I get s3.list_objects_v2 needs to be modified but I am not familiar with it. Thanks in advance.
You have to extend your code and add pagination. Only using pagination you can get full list of your bucket.

How to download latest n items from AWS S3 bucket using boto3?

I have an S3 bucket where my application saves some final result DataFrames as .csv files. I would like to download the latest 1000 files in this bucket, but I don't know how to do it.
I cannot do it mannualy, as the bucket doesn't allow me to sort the files by date because it has more than 1000 elements
I've seen some questions that could work using AWS CLI, but I don't have enough user permissions to use the AWS CLI, so I have to do it with a boto3 python script that I'm going to upload into a lambda.
How can I do this?
If your application uploads files periodically, you could try this:
import boto3
import datetime
last_n_days = 250
s3 = boto3.client('s3')
paginator = s3.get_paginator('list_objects_v2')
pages = paginator.paginate(Bucket='bucket', Prefix='processed')
date_limit = datetime.datetime.now() - datetime.timedelta(30)
for page in pages:
for obj in page['Contents']:
if obj['LastModified'] >= date_limit and obj['Key'][-1] != '/':
s3.download_file('bucket', obj['Key'], obj['Key'].split('/')[-1])
With the script above, all files modified in the last 250 days will be downloaded. If your application uploads 4 files per day, this could do the fix.
The best solution is to redefine your problem: rather than retrieving the N most recent files, retrieve all files from the N most recent days. I think that you'll find this to be a better solution in most cases.
However, to make it work you'll need to adopt some form of date-stamped prefix for the uploaded files. For example, 2021-04-16/myfile.csv.
If you feel that you must retrieve N files, then you can use the prefix to retrieve only a portion of the list. Assuming that you know that you have approximately 100 files uploaded per day, then start your bucket listing with 2021-04-05/.

Download latest uploaded file from amazon s3 using boto3 in python

I have few csv files inside one of my buckets on amazon s3.
I need to download the latest uploaded csv file.
How to achieve this using boto3 in python??
Thanks.
S3 doesn't have an API for listing files ordered by date
However, if you indeed have only a few, you can list the files in the bucket and order them by last modification time.
bucketList = s3Client.list_objects(Bucket=<MyBucket>) # notice this is up to 1000 files
orderedList = sorted(bucketList, key=lambda k: k.last_modified)
lastUpdatedKey = orderedList[-1]
object = s3Client.get_object(Bucket=<MyBucket>, Key=lastUpdatedKey )

How should i create the index of files using boto and S3 using python django

I have the folder structure with 2000 files on S3.
I want that every week i run the program that gets the lists of files are folders from s3 and populates the database.
Then i use that database to show same folder structure on the site.
I ahve two problems
How can i get the list of folders from there and then store them in mysql. Do i need to grab all the file names and then split with "/" . But it looks diffuclt to see which files belong to which folders. I have found this https://stackoverflow.com/a/17096755/1958218 but could not found where is listObjects() function
doesn't get_all_keys() method of the s3 bucket do what you need:
s3 = boto.connect_s3()
b = s3.get_bucket('bucketname')
keys = b.get_all_keys()
then iterate over keys, do os.path.split and unique...

Categories