I am using the below code and referred to many SO answers for listing files under a folder using boto3 and python but was unable to do so. Below is my code:
s3 = boto3.client('s3')
object_listing = s3.list_objects_v2(Bucket='maxValue',
Prefix='madl-temp/')
My s3 path is "s3://madl-temp/maxValue/" where I want to find if there are any parquet files under the maxValue bucket based on which I have to do something like below:
If len(maxValue)>0:
maxValue=true
else:
maxValue=false
I am running it via Glue jobs and I am getting the below error:
botocore.errorfactory.NoSuchBucket: An error occurred (NoSuchBucket) when calling the ListObjectsV2 operation: The specified bucket does not exist
Your bucket name is madl-temp and prefix is maxValue. But in boto3, you have the opposite. So it should be:
s3 = boto3.client('s3')
object_listing = s3.list_objects_v2(Bucket='madl-temp',
Prefix='maxValue/')
To get the number of files you have to do:
len(object_listing['Contents']) - 1
where -1 accounts for a prefix maxValue/.
Related
Can't seem to figure out how to translate what I can do with the cli to boto3 python.
I can run this fine:
aws s3 ls s3://bucket-name-format/folder1/folder2/
aws s3 cp s3://bucket-name-format/folder1/folder2/myfile.csv.gz
Trying to do this with boto3:
import boto3
s3 = boto3.client('s3', region_name='us-east-1', aws_access_key_id=KEY_ID, aws_secret_access_key=ACCESS_KEY)
bucket_name = "bucket-name-format"
bucket_dir = "/folder1/folder2/"
bucket = '{0}{1}'.format(bucket_name,bucket_dir)
filename = 'myfile.csv.gz'
s3.download_file(Filename=final_name,Bucket=bucket,Key=filename)
I get this error :
invalid bucket name "bucket-name-format/folder1/folder2/": Bucket name must match the regex "^[a-zA-Z0-9.-_]{1,255}$" or be an ARN matching the regex "^arn:(aws).:(s3|s3-object-lambda):[a-z-0-9]:[0-9]{12}:accesspoint[/:][a-zA-Z0-9-.]{1,63}$|^arn:(aws).:s3-outposts:[a-z-0-9]+:[0-9]{12}:outpost[/:][a-zA-Z0-9-]{1,63}[/:]accesspoint[/:][a-zA-Z0-9-]{1,63}$"*
I know the error is because the bucket name "bucket-name-format/folder1/folder2/" is indeed invalid.
Question: how do I add the path? All the examples Ive seen just list the base bucket name
Take the following command:
aws s3 cp s3://bucket-name-format/folder1/folder2/myfile.csv.gz
That S3 URI can be broken down into
Bucket Name: bucket-name-format
Object Prefix: folder1/folder2/
Object Suffix: myfile.csv.gz
Really the prefix and suffix are a bit artificial, the object name is really folder1/folder2/myfile.csv.gz
This means to download the same object with the boto3 API, you want to call it with something like:
bucket_name = "bucket-name-format"
bucket_dir = "folder1/folder2/"
filename = 'myfile.csv.gz'
s3.download_file(Filename=final_name,Bucket=bucket_name,Key=bucket_dir + filename)
Note that the argument to download_file for the Bucket is just the bucket name, and the Key does not start with a forward slash.
I am sucessfully downloading an image file to my local computer from my S3 bucket using the following:
import os
import boto3
import botocore
files = ['images/dog_picture.png']
bucket = 'animals'
s3 = boto3.resource('s3')
for file in files:
s3.Bucket(bucket).download_file(file, os.path.basename(file))
However, when I try to specify the directory to which the image should be saved on my local machine as is done in the docs:
s3.Bucket(bucket).download_file(file, os.path.basename(file), '/home/user/storage/new_image.png')
I get:
ValueError: Invalid extra_args key '/home/user/storage/new_image.png', must be one of: VersionId, SSECustomerAlgorithm, SSECustomerKey, SSECustomerKeyMD5, RequestPayer
I must be doing something wrong but I'm following the example in the docs. Can someone help me specify a local directory?
Looking into the docs, you're providing an extra parameter
import boto3
s3 = boto3.resource('s3')
s3.Bucket('mybucket').download_file('hello.txt', '/tmp/hello.txt')
From the docs, hello.txt is the name of the object on the bucket and /tmp/hello.txt is the path on your device, so the correct way would be
s3.Bucket(bucket).download_file(file, '/home/user/storage/new_image.png')
The code to list contents in S3 using boto3 is known:
self.s3_client = boto3.client(
u's3',
aws_access_key_id=config.AWS_ACCESS_KEY_ID,
aws_secret_access_key=config.AWS_SECRET_ACCESS_KEY,
region_name=config.region_name,
config=Config(signature_version='s3v4')
)
versions = self.s3_client.list_objects(Bucket=self.bucket_name, Prefix=self.package_s3_version_key)
However, I need to list contents on S3 using libcloud. I could not find it in the documentation.
If you are just looking for all the contents for a specific bucket:
from libcloud.storage.types import Provider
from libcloud.storage.providers import get_driver
client = driver(StoreProvider.S3)
s3 = client(aws_id, aws_secret)
container = s3.get_container(container_name='name')
objects = s3.list_container_objects(container)
s3.download_object(objects[0], '/path/to/download')
The resulting objects will contain a list of all the keys in that bucket with filename, byte size, and metadata. To download call the download_object method on s3 with the full libcloud Object and your file path.
If you'd rather get all objects of all buckets, change get_container to list_containers with no parameters.
Information for all driver methods: https://libcloud.readthedocs.io/en/latest/storage/api.html
Short examples specific to s3: https://libcloud.readthedocs.io/en/latest/storage/drivers/s3.html
How to list S3 bucket Delimiter paths?
Basically I want to list all of the "directories" and or "sub-directories" in a s3 bucket. I know these don't physically exist. Basically I want all the objects that contain the delimiter and then only return the key path before for the delimiter. Starting under a prefix would be even better but at the bucket level should be enough.
Example S3 Bucket:
root.json
/2018/cats/fluffy.png
/2018/cats/gary.png
/2018/dogs/rover.png
/2018/dogs/jax.png
I would like to then do something like:
s3_client = boto3.client('s3')
s3_client.list_objects(only_show_delimiter_paths=True)
Result
/2018/
/2018/cats/
/2018/dogs/
I don't see any way to do this natively using: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.list_objects
I could pull all the object names and do this in my application code but that seems inefficient.
The Amazon S3 page in boto3 has this example:
List top-level common prefixes in Amazon S3 bucket
This example shows how to list all of the top-level common prefixes in an Amazon S3 bucket:
import boto3
client = boto3.client('s3')
paginator = client.get_paginator('list_objects')
result = paginator.paginate(Bucket='my-bucket', Delimiter='/')
for prefix in result.search('CommonPrefixes'):
print(prefix.get('Prefix'))
But, it only shows top-level prefixes.
So, here's some code to print all the 'folders':
import boto3
client = boto3.client('s3')
objects = client.list_objects_v2(Bucket='my-bucket')
keys = [o['Key'] for o in objects['Contents']]
folders = {k[:k.rfind('/')+1] for k in keys if k.rfind('/') != -1}
print ('\n'.join(folders))
I am trying to retrieve a JSON file from an s3 bucket inside a glue pyspark script.
I am running this function in the job inside aws glue:
def run(spark):
s3_bucket_path = 's3://bucket/data/file.gz'
df = spark.read.json(s3_bucket_path)
df.show()
After this I am getting:
AnalysisException: u'Path does not exist: s3://bucket/data/file.gz;'
I searched for this issue and did not find anything that would be similar enough to infer where is the issue. I think there might be permission issues accessing the bucket, but then the error message should be different.
Here You can Try This :
s3 = boto3.client("s3", region_name="us-west-2", aws_access_key_id="
", aws_secret_access_key="")
jsonFile = s3.get_object(Bucket=bucket, Key=key)
jsonObject = json.load(jsonFile["Body"])
where Key = full path to your file in bucket
and use this jsonObject in spark.read.json(jsonObject)