Download latest uploaded file from amazon s3 using boto3 in python - python

I have few csv files inside one of my buckets on amazon s3.
I need to download the latest uploaded csv file.
How to achieve this using boto3 in python??
Thanks.

S3 doesn't have an API for listing files ordered by date
However, if you indeed have only a few, you can list the files in the bucket and order them by last modification time.
bucketList = s3Client.list_objects(Bucket=<MyBucket>) # notice this is up to 1000 files
orderedList = sorted(bucketList, key=lambda k: k.last_modified)
lastUpdatedKey = orderedList[-1]
object = s3Client.get_object(Bucket=<MyBucket>, Key=lastUpdatedKey )

Related

Find all JSON files within S3 Bucket

is it possible to find all .json files within S3 bucket where the bucket itself can have multiple sub-directories ?
Actually my bucket includes multiple sub-directories where i would like to collect all JSON files inside it in order to iterate over them and parse specific key/values.
Here's the solution (uses the boto module):
import boto3
s3 = boto3.client('s3') # Create the connection to your bucket
objs = s3.list_objects_v2(Bucket='my-bucket')['Contents']
files = filter(lambda obj: obj['Key'].endswith('.json'), objs) # json only
return files
The syntax for the list_objects_v2 function in boto3 can be found here: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.list_objects_v2
Note that only the first 1000 keys are returned. To retrieve more than 1000 keys to fully exhaust the list, you can use the Paginator class.
s3 = boto3.client('s3') # Create the connection to your bucket
paginator = s3.get_paginator('list_objects_v2')
pages = paginator.paginate(Bucket='my-bucket')
files = []
for page in pages:
for obj in page['Contents']:
page_files = filter(lambda obj: obj['Key'].endswith('.json'), objs) # json only
files.extend(page_files)
return files
Note: I recommend using a function that uses yield to slowly iterate over the files instead of returning the whole list, especially if the number of json files in your bucket is extremely large.
Alternatively, you can also use the ContinuationToken parameter (check the boto3 reference linked above).

python boto3 for AWS - S3 Bucket Sync optimization

currently I am trying to compare two S3 buckets with the target to delete files.
problem defintion:
-BucketA
-BucketB
The script is looking for files (same key name) in BucketB which are not available in BucketA.
That files which are only available in BucketB have to be deleted.
The buckets contain about 3-4 Million files each.
Many thanks.
Kind regards,
Alexander
My solution idea:
The List filling is quite slow. Is there any possibility to accelerate it?
#Filling the lists
#e.g. BucketA (BucketB same procedure)
s3_client = boto3.client(3")
bucket_name = "BucketA"
paginator = s3_client.get_paginator("list_objects_v2")
response = paginator.paginate(Bucket="BucketA", PaginationConfig={"PageSize": 2})
for page in response:
files = page.get("Contents")
for file in files:
ListA.append(file['Key'])
#finding the deletion values
diff = list(set(BucketB) - set(BucketA))
#delete files from BucketB (building junks, since with delete_objects_from_bucket max. 1000 objects at once)
for delete_list in group_elements(diff , 1000):
delete_objects_from_bucket(delete_list)
The ListBucket() API call only returns 1000 objects at a time, so listing buckets with 100,000+ objects is very slow and best avoided. You have 3-4 million objects, so definitely avoid listing them!
Instead, use Amazon S3 Inventory, which can provide a daily or weekly CSV file listing all objects in a bucket. Activate it for both buckets, and then use the provided CSV files as input to make the comparison.
You can also use the CSV file generated by Amazon S3 Inventory as a manifest file for Amazon S3 Batch Operations. So, your code could generate a file that lists only the objects that you would like deleted, then get S3 Batch Operations to process those deletions.

get just sub folder name in bucket s3 [duplicate]

I have boto code that collects S3 sub-folders in levelOne folder:
import boto
s3 = boto.connect_s3()
bucket = s3.get_bucket("MyBucket")
for level2 in bucket.list(prefix="levelOne/", delimiter="/"):
print(level2.name)
Please help to discover similar functionality in boto3. The code should not iterate through all S3 objects because the bucket has a very big number of objects.
If you are simply seeking a list of folders, then use CommonPrefixes returned when listing objects. Note that a Delimiter must be specified to obtain the CommonPrefixes:
import boto3
s3_client = boto3.client('s3')
response = s3_client.list_objects_v2(Bucket='BUCKET-NAME', Delimiter = '/')
for prefix in response['CommonPrefixes']:
print(prefix['Prefix'][:-1])
If your bucket has a HUGE number of folders and objects, you might consider using Amazon S3 Inventory, which can provide a daily or weekly CSV file listing all objects.
I think the following should be equivalent:
import boto3
s3 = boto3.resource('s3')
bucket = s3.Bucket('MyBucket')
for object in bucket.objects.filter(Prefix="levelOne/", Delimiter="/"):
print(object.key)

How to download latest n items from AWS S3 bucket using boto3?

I have an S3 bucket where my application saves some final result DataFrames as .csv files. I would like to download the latest 1000 files in this bucket, but I don't know how to do it.
I cannot do it mannualy, as the bucket doesn't allow me to sort the files by date because it has more than 1000 elements
I've seen some questions that could work using AWS CLI, but I don't have enough user permissions to use the AWS CLI, so I have to do it with a boto3 python script that I'm going to upload into a lambda.
How can I do this?
If your application uploads files periodically, you could try this:
import boto3
import datetime
last_n_days = 250
s3 = boto3.client('s3')
paginator = s3.get_paginator('list_objects_v2')
pages = paginator.paginate(Bucket='bucket', Prefix='processed')
date_limit = datetime.datetime.now() - datetime.timedelta(30)
for page in pages:
for obj in page['Contents']:
if obj['LastModified'] >= date_limit and obj['Key'][-1] != '/':
s3.download_file('bucket', obj['Key'], obj['Key'].split('/')[-1])
With the script above, all files modified in the last 250 days will be downloaded. If your application uploads 4 files per day, this could do the fix.
The best solution is to redefine your problem: rather than retrieving the N most recent files, retrieve all files from the N most recent days. I think that you'll find this to be a better solution in most cases.
However, to make it work you'll need to adopt some form of date-stamped prefix for the uploaded files. For example, 2021-04-16/myfile.csv.
If you feel that you must retrieve N files, then you can use the prefix to retrieve only a portion of the list. Assuming that you know that you have approximately 100 files uploaded per day, then start your bucket listing with 2021-04-05/.

How should i create the index of files using boto and S3 using python django

I have the folder structure with 2000 files on S3.
I want that every week i run the program that gets the lists of files are folders from s3 and populates the database.
Then i use that database to show same folder structure on the site.
I ahve two problems
How can i get the list of folders from there and then store them in mysql. Do i need to grab all the file names and then split with "/" . But it looks diffuclt to see which files belong to which folders. I have found this https://stackoverflow.com/a/17096755/1958218 but could not found where is listObjects() function
doesn't get_all_keys() method of the s3 bucket do what you need:
s3 = boto.connect_s3()
b = s3.get_bucket('bucketname')
keys = b.get_all_keys()
then iterate over keys, do os.path.split and unique...

Categories