How to list S3 bucket Delimiter paths?
Basically I want to list all of the "directories" and or "sub-directories" in a s3 bucket. I know these don't physically exist. Basically I want all the objects that contain the delimiter and then only return the key path before for the delimiter. Starting under a prefix would be even better but at the bucket level should be enough.
Example S3 Bucket:
root.json
/2018/cats/fluffy.png
/2018/cats/gary.png
/2018/dogs/rover.png
/2018/dogs/jax.png
I would like to then do something like:
s3_client = boto3.client('s3')
s3_client.list_objects(only_show_delimiter_paths=True)
Result
/2018/
/2018/cats/
/2018/dogs/
I don't see any way to do this natively using: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.list_objects
I could pull all the object names and do this in my application code but that seems inefficient.
The Amazon S3 page in boto3 has this example:
List top-level common prefixes in Amazon S3 bucket
This example shows how to list all of the top-level common prefixes in an Amazon S3 bucket:
import boto3
client = boto3.client('s3')
paginator = client.get_paginator('list_objects')
result = paginator.paginate(Bucket='my-bucket', Delimiter='/')
for prefix in result.search('CommonPrefixes'):
print(prefix.get('Prefix'))
But, it only shows top-level prefixes.
So, here's some code to print all the 'folders':
import boto3
client = boto3.client('s3')
objects = client.list_objects_v2(Bucket='my-bucket')
keys = [o['Key'] for o in objects['Contents']]
folders = {k[:k.rfind('/')+1] for k in keys if k.rfind('/') != -1}
print ('\n'.join(folders))
Related
I am using the below code and referred to many SO answers for listing files under a folder using boto3 and python but was unable to do so. Below is my code:
s3 = boto3.client('s3')
object_listing = s3.list_objects_v2(Bucket='maxValue',
Prefix='madl-temp/')
My s3 path is "s3://madl-temp/maxValue/" where I want to find if there are any parquet files under the maxValue bucket based on which I have to do something like below:
If len(maxValue)>0:
maxValue=true
else:
maxValue=false
I am running it via Glue jobs and I am getting the below error:
botocore.errorfactory.NoSuchBucket: An error occurred (NoSuchBucket) when calling the ListObjectsV2 operation: The specified bucket does not exist
Your bucket name is madl-temp and prefix is maxValue. But in boto3, you have the opposite. So it should be:
s3 = boto3.client('s3')
object_listing = s3.list_objects_v2(Bucket='madl-temp',
Prefix='maxValue/')
To get the number of files you have to do:
len(object_listing['Contents']) - 1
where -1 accounts for a prefix maxValue/.
Can't seem to figure out how to translate what I can do with the cli to boto3 python.
I can run this fine:
aws s3 ls s3://bucket-name-format/folder1/folder2/
aws s3 cp s3://bucket-name-format/folder1/folder2/myfile.csv.gz
Trying to do this with boto3:
import boto3
s3 = boto3.client('s3', region_name='us-east-1', aws_access_key_id=KEY_ID, aws_secret_access_key=ACCESS_KEY)
bucket_name = "bucket-name-format"
bucket_dir = "/folder1/folder2/"
bucket = '{0}{1}'.format(bucket_name,bucket_dir)
filename = 'myfile.csv.gz'
s3.download_file(Filename=final_name,Bucket=bucket,Key=filename)
I get this error :
invalid bucket name "bucket-name-format/folder1/folder2/": Bucket name must match the regex "^[a-zA-Z0-9.-_]{1,255}$" or be an ARN matching the regex "^arn:(aws).:(s3|s3-object-lambda):[a-z-0-9]:[0-9]{12}:accesspoint[/:][a-zA-Z0-9-.]{1,63}$|^arn:(aws).:s3-outposts:[a-z-0-9]+:[0-9]{12}:outpost[/:][a-zA-Z0-9-]{1,63}[/:]accesspoint[/:][a-zA-Z0-9-]{1,63}$"*
I know the error is because the bucket name "bucket-name-format/folder1/folder2/" is indeed invalid.
Question: how do I add the path? All the examples Ive seen just list the base bucket name
Take the following command:
aws s3 cp s3://bucket-name-format/folder1/folder2/myfile.csv.gz
That S3 URI can be broken down into
Bucket Name: bucket-name-format
Object Prefix: folder1/folder2/
Object Suffix: myfile.csv.gz
Really the prefix and suffix are a bit artificial, the object name is really folder1/folder2/myfile.csv.gz
This means to download the same object with the boto3 API, you want to call it with something like:
bucket_name = "bucket-name-format"
bucket_dir = "folder1/folder2/"
filename = 'myfile.csv.gz'
s3.download_file(Filename=final_name,Bucket=bucket_name,Key=bucket_dir + filename)
Note that the argument to download_file for the Bucket is just the bucket name, and the Key does not start with a forward slash.
The code to list contents in S3 using boto3 is known:
self.s3_client = boto3.client(
u's3',
aws_access_key_id=config.AWS_ACCESS_KEY_ID,
aws_secret_access_key=config.AWS_SECRET_ACCESS_KEY,
region_name=config.region_name,
config=Config(signature_version='s3v4')
)
versions = self.s3_client.list_objects(Bucket=self.bucket_name, Prefix=self.package_s3_version_key)
However, I need to list contents on S3 using libcloud. I could not find it in the documentation.
If you are just looking for all the contents for a specific bucket:
from libcloud.storage.types import Provider
from libcloud.storage.providers import get_driver
client = driver(StoreProvider.S3)
s3 = client(aws_id, aws_secret)
container = s3.get_container(container_name='name')
objects = s3.list_container_objects(container)
s3.download_object(objects[0], '/path/to/download')
The resulting objects will contain a list of all the keys in that bucket with filename, byte size, and metadata. To download call the download_object method on s3 with the full libcloud Object and your file path.
If you'd rather get all objects of all buckets, change get_container to list_containers with no parameters.
Information for all driver methods: https://libcloud.readthedocs.io/en/latest/storage/api.html
Short examples specific to s3: https://libcloud.readthedocs.io/en/latest/storage/drivers/s3.html
I have a directory in my s3 bucket 'test', I want to delete this directory.
This is what I'm doing
s3 = boto3.resource('s3')
s3.Object(S3Bucket,'test').delete()
and getting response like this
{'ResponseMetadata': {'HTTPStatusCode': 204, 'HostId':
'************', 'RequestId': '**********'}}
but my directory is not getting deleted!
I tried with all combinations of '/test', 'test/' and '/test/' etc, also with a file inside that directory and with empty directory and all failed to delete 'test'.
delete_objects enables you to delete multiple objects from a bucket using a single HTTP request. You may specify up to 1000 keys.
https://boto3.readthedocs.io/en/latest/reference/services/s3.html#S3.Bucket.delete_objects
import boto3
s3 = boto3.resource('s3')
bucket = s3.Bucket('my-bucket')
objects_to_delete = []
for obj in bucket.objects.filter(Prefix='test/'):
objects_to_delete.append({'Key': obj.key})
bucket.delete_objects(
Delete={
'Objects': objects_to_delete
}
)
NOTE: See Daniel Levinson's answer for a more efficient way of deleting multiple objects.
In S3, there are no directories, only keys. If a key name contains a / such as prefix/my-key.txt, then the AWS console groups all the keys that share this prefix together for convenience.
To delete a "directory", you would have to find all the keys that whose names start with the directory name and delete each one individually. Fortunately, boto3 provides a filter function to return only the keys that start with a certain string. So you can do something like this:
s3 = boto3.resource('s3')
bucket = s3.Bucket('my-bucket-name')
for obj in bucket.objects.filter(Prefix='test/'):
s3.Object(bucket.name, obj.key).delete()
I have a S3 server with millions of files under each bucket. I want to download files from a bucket, but to download only files that meet a particular condition.
Is there a better way than getting all bucket and then checking the particular condition while iterating over the files?
As can be seen here:
import os
# Import the SDK
import boto
from boto.s3.connection import OrdinaryCallingFormat
LOCAL_PATH = 'W:/RD/Fancy/s3_opportunities/'
bucket_name = '/recording'#/sampledResponseLogger'
# connect to the bucket
print 'Connecting...'
conn = boto.connect_s3(calling_format=OrdinaryCallingFormat()) #conn = boto.connect_s3()
print 'Getting bucket...'
bucket = conn.get_bucket(bucket_name)
print 'Going through the list of files...'
bucket_list = bucket.list()
for l in bucket_list:
keyString = str(l.key)
# SOME CONDITION
if('2015-08' in keyString):
# check if file exists locally, if not: download it
filename=LOCAL_PATH+keyString[56:]
if not os.path.exists(filename):
print 'Downloading file: ' + keyString + '...'
# Download the object that the key represents
l.get_contents_to_filename(filename)
The only mechanism available for filtering ListBucket operations on the server side is the prefix. So, if your objects in S3 have some sort of an implied directory structure (e.g. foo/bar/fie/baz/object1) then you can use the prefix to list only the objects that start with, for example, foo/bar/fie. If your object names do not display this hierarchical naming, there really isn't anything you can do except list all of the objects and filter using your own mechanism.