Find all JSON files within S3 Bucket - python

is it possible to find all .json files within S3 bucket where the bucket itself can have multiple sub-directories ?
Actually my bucket includes multiple sub-directories where i would like to collect all JSON files inside it in order to iterate over them and parse specific key/values.

Here's the solution (uses the boto module):
import boto3
s3 = boto3.client('s3') # Create the connection to your bucket
objs = s3.list_objects_v2(Bucket='my-bucket')['Contents']
files = filter(lambda obj: obj['Key'].endswith('.json'), objs) # json only
return files
The syntax for the list_objects_v2 function in boto3 can be found here: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.list_objects_v2
Note that only the first 1000 keys are returned. To retrieve more than 1000 keys to fully exhaust the list, you can use the Paginator class.
s3 = boto3.client('s3') # Create the connection to your bucket
paginator = s3.get_paginator('list_objects_v2')
pages = paginator.paginate(Bucket='my-bucket')
files = []
for page in pages:
for obj in page['Contents']:
page_files = filter(lambda obj: obj['Key'].endswith('.json'), objs) # json only
files.extend(page_files)
return files
Note: I recommend using a function that uses yield to slowly iterate over the files instead of returning the whole list, especially if the number of json files in your bucket is extremely large.
Alternatively, you can also use the ContinuationToken parameter (check the boto3 reference linked above).

Related

boto3: How to Find Prefix of S3 Object and Update Object Given the Filename?

I have a populated S3 bucket and a filename I want to search for and update. The bucket has an existing list of objects (with prefixes). I am trying to make an API call to the bucket to update a given object. However, I need to search through all the objects in the bucket (including all prefixes), and then update it (via an s3 file upload) if it matches the current filename.
For example, my bucket has the following in root:
cat2.jpg
cat3.jpg
PRE CAT_PICS/cat.jpg
PRE CAT_PICS/cat2.jpg
PRE CAT_PICS/MORE_CAT_PICS/cat3.jpg
If I want to "recursively" search through all objects, match, and update cat3.jpg how could I do that? Additionally, is there a way to extract the prefix of the found object? I have some code already but I am not sure if it is correct. It also cannot cover more than 1000 objects since it lacks pagination:
My Current Code:
import boto3
FILE_TO_UPDATE = cat3.jpg
s3_bucket = "example_bucket2182023"
s3_client = boto3.client("s3")
for my_bucket_object in my_bucket.objects.all():
if my_bucket_object.endswith(FILE_TO_UPDATE):
try:
s3_client.put_object(
Bucket=s3_bucket,
key=my_bucket_object
)
except:
print(f"Error uploading {my_bucket_object} to {s3_bucket}")
print(my_bucket_object.key)
For pagination, you can use get_paginator() from boto3 S3 client. See reference here: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/paginators.html#customizing-page-iterators

get just sub folder name in bucket s3 [duplicate]

I have boto code that collects S3 sub-folders in levelOne folder:
import boto
s3 = boto.connect_s3()
bucket = s3.get_bucket("MyBucket")
for level2 in bucket.list(prefix="levelOne/", delimiter="/"):
print(level2.name)
Please help to discover similar functionality in boto3. The code should not iterate through all S3 objects because the bucket has a very big number of objects.
If you are simply seeking a list of folders, then use CommonPrefixes returned when listing objects. Note that a Delimiter must be specified to obtain the CommonPrefixes:
import boto3
s3_client = boto3.client('s3')
response = s3_client.list_objects_v2(Bucket='BUCKET-NAME', Delimiter = '/')
for prefix in response['CommonPrefixes']:
print(prefix['Prefix'][:-1])
If your bucket has a HUGE number of folders and objects, you might consider using Amazon S3 Inventory, which can provide a daily or weekly CSV file listing all objects.
I think the following should be equivalent:
import boto3
s3 = boto3.resource('s3')
bucket = s3.Bucket('MyBucket')
for object in bucket.objects.filter(Prefix="levelOne/", Delimiter="/"):
print(object.key)

I am creating a lambda using Python to create a file in s3 bucket but its only creating one row. Need iteration based creation

I am creating an AWS Lambda function using Python to create a file in an S3 bucket but it is only creating one row. Need iteration based creation.
Below is my code:
import json
import boto3
s3 = boto3.client('s3')
def lambda_handler(event, context):
bucket ='xyz'
eventToUpload = {}
eventToUpload['ITEM_ID'] = 1234
eventToUpload['Name'] = John
eventToUpload['Office'] = NY
fileName = 'testevent' + '.json'
uploadByteStream = bytes(json.dumps(eventToUpload).encode('UTF-8'))
s3.put_object(Bucket=bucket, Key=fileName, Body=uploadByteStream)
print('Put Complete')
Above code is creating a json file in the S3 bucket but I want to iteration of say 5 so that file has 5 record.
AWS S3 doesn't by default, nor does it support, append content to an existing file.
You can either:
Loop through your events on your Python script to create multiple files with an uniquely identified fileName prefix/sufix, say, ITEM_ID;
Read and parse the existing file beforehand, if any, then append your events to it and write/replace the file.
This question seems to have useful answers to your case as well.

Is there a way to get latest folder if folders are arranged as yyyy/mm/dd/hh in S3

I have folders in s3 bucket structured as YYYY/MM/DD/HH/file.txt
I am using a Lambda function whose input will be YYYY/MM/DD/HH and the Lambda function will return content from the file.
Let's say these are valid folders (meaning they have file.txt):
2018/12/30/12
2018/12/30/17
2018/12/30/21
If I were to input 2018/12/30/15 I want my Lambda function to print the file from the latest folder before the user given time so it would give me the file from 2018/12/30/12.
I tried going back 1 hour and using s3.getObject() to check if that file exists.
Can I know how can I use list_object() to achieve this as the above method is not preferable?
I am using Lambda, boto3, python.
First, it's helpful to note that there aren't really subfolders in a bucket. Key names of objects may contain slashes and the S3 console infers some hierarchy.
The Amazon S3 data model is a flat structure: You create a bucket, and
the bucket stores objects. There is no hierarchy of subbuckets or
subfolders. However, you can infer logical hierarchy using key name
prefixes and delimiters as the Amazon S3 console does. The Amazon S3
console supports a concept of folders.
https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingMetadata.html
That said, you can get the list of all keys and then find the maximum key that is less than your target value.
import json
import boto3
def lambda_handler(event, context):
bucket_name = "your-bucket-name"
# The max_key value is a parameter, but it's not clear
# how the Lambda will be called so I just hard-coded it.
max_key = "2018/12/30/15"
s3 = boto3.resource('s3')
my_bucket = s3.Bucket(bucket_name)
# Get the list of all items. The default max is 1,000 so you may need
# to do some paging if your bucket has more items than 1,000.
all_items = my_bucket.objects.all()
# Extract the actual key names into a list
all_keys = [item.key for item in all_items]
# Find the key that is the max() value less than the incoming key (max_key)
golden_key = max([key for key in all_keys if key < max_key])
result = f'The biggest key less than "{max_key}" is: "{golden_key}"'
print(result)
return {
'statusCode': 200,
'body': json.dumps(result)
}

Download latest uploaded file from amazon s3 using boto3 in python

I have few csv files inside one of my buckets on amazon s3.
I need to download the latest uploaded csv file.
How to achieve this using boto3 in python??
Thanks.
S3 doesn't have an API for listing files ordered by date
However, if you indeed have only a few, you can list the files in the bucket and order them by last modification time.
bucketList = s3Client.list_objects(Bucket=<MyBucket>) # notice this is up to 1000 files
orderedList = sorted(bucketList, key=lambda k: k.last_modified)
lastUpdatedKey = orderedList[-1]
object = s3Client.get_object(Bucket=<MyBucket>, Key=lastUpdatedKey )

Categories