boto3 download with file path - python

Can't seem to figure out how to translate what I can do with the cli to boto3 python.
I can run this fine:
aws s3 ls s3://bucket-name-format/folder1/folder2/
aws s3 cp s3://bucket-name-format/folder1/folder2/myfile.csv.gz
Trying to do this with boto3:
import boto3
s3 = boto3.client('s3', region_name='us-east-1', aws_access_key_id=KEY_ID, aws_secret_access_key=ACCESS_KEY)
bucket_name = "bucket-name-format"
bucket_dir = "/folder1/folder2/"
bucket = '{0}{1}'.format(bucket_name,bucket_dir)
filename = 'myfile.csv.gz'
s3.download_file(Filename=final_name,Bucket=bucket,Key=filename)
I get this error :
invalid bucket name "bucket-name-format/folder1/folder2/": Bucket name must match the regex "^[a-zA-Z0-9.-_]{1,255}$" or be an ARN matching the regex "^arn:(aws).:(s3|s3-object-lambda):[a-z-0-9]:[0-9]{12}:accesspoint[/:][a-zA-Z0-9-.]{1,63}$|^arn:(aws).:s3-outposts:[a-z-0-9]+:[0-9]{12}:outpost[/:][a-zA-Z0-9-]{1,63}[/:]accesspoint[/:][a-zA-Z0-9-]{1,63}$"*
I know the error is because the bucket name "bucket-name-format/folder1/folder2/" is indeed invalid.
Question: how do I add the path? All the examples Ive seen just list the base bucket name

Take the following command:
aws s3 cp s3://bucket-name-format/folder1/folder2/myfile.csv.gz
That S3 URI can be broken down into
Bucket Name: bucket-name-format
Object Prefix: folder1/folder2/
Object Suffix: myfile.csv.gz
Really the prefix and suffix are a bit artificial, the object name is really folder1/folder2/myfile.csv.gz
This means to download the same object with the boto3 API, you want to call it with something like:
bucket_name = "bucket-name-format"
bucket_dir = "folder1/folder2/"
filename = 'myfile.csv.gz'
s3.download_file(Filename=final_name,Bucket=bucket_name,Key=bucket_dir + filename)
Note that the argument to download_file for the Bucket is just the bucket name, and the Key does not start with a forward slash.

Related

listing s3 buckets using boto3 and python

I am using the below code and referred to many SO answers for listing files under a folder using boto3 and python but was unable to do so. Below is my code:
s3 = boto3.client('s3')
object_listing = s3.list_objects_v2(Bucket='maxValue',
Prefix='madl-temp/')
My s3 path is "s3://madl-temp/maxValue/" where I want to find if there are any parquet files under the maxValue bucket based on which I have to do something like below:
If len(maxValue)>0:
maxValue=true
else:
maxValue=false
I am running it via Glue jobs and I am getting the below error:
botocore.errorfactory.NoSuchBucket: An error occurred (NoSuchBucket) when calling the ListObjectsV2 operation: The specified bucket does not exist
Your bucket name is madl-temp and prefix is maxValue. But in boto3, you have the opposite. So it should be:
s3 = boto3.client('s3')
object_listing = s3.list_objects_v2(Bucket='madl-temp',
Prefix='maxValue/')
To get the number of files you have to do:
len(object_listing['Contents']) - 1
where -1 accounts for a prefix maxValue/.

Mock download file from s3 with actual file

I would like to write a test to mock the download of a function from s3 and replace it locally with an actual file that exists of my machine. I took inspiration from this post. The idea is the following:
from moto import mock_s3
import boto3
def dl(src_f, dest_f):
s3 = boto3.resource('s3')
s3.Bucket('fake_bucket').download_file(src_f, dest_f)
#mock_s3
def _create_and_mock_bucket():
# Create fake bucket and mock it
bucket = "fake_bucket"
# We need to create the bucket since this is all in Moto's 'virtual' AWS account
file_path = "some_real_file.txt"
s3 = boto3.client("s3", region_name="us-east-1")
s3.create_bucket(Bucket=bucket)
s3.put_object(Bucket=bucket, Key=file_path, Body="")
dl(file_path, 'some_other_real_file.txt')
_create_and_mock_bucket()
Now some_other_real_file.txt exists, but it is not a copy of some_real_file.txt. Any idea on how to do that?
If 'some_real_file.txt' already exists on your system, you should use upload_file instead:
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.upload_file
For your example:
file_path = "some_real_file.txt"
s3 = boto3.client("s3", region_name="us-east-1")
s3.create_bucket(Bucket=bucket)
s3_resource = boto3.resource('s3')
s3_resource.meta.client.upload_file(file_path, bucket, file_path)
Your code currently creates an empty file in S3 (since Body=""), and that is exactly what is being downloaded to 'some_other_real_file.txt'.
Notice that, if you change the Body-parameter to have some text in it, that exact content will be downloaded to 'some_other_real_file.txt'.

Airflow : Download latest file from S3 with Wildcard

Requirement: To download the latest file i.e., current file from s3
Sample file in s3
bucketname/2020/09/reporting_2020_09_20200902000335.zip
bucketname/2020/09/reporting_2020_09_20200901000027.zip
When I pass the s3_src_key as /2020/09/reporting_2020_09_20200902 doesn't work for below one
Code:
with tempfile.NamedTemporaryFile('r') as f_source, tempfile.NamedTemporaryFile('w') as f_target:
s3_client.download_file(self.s3_src_bucket, self.s3_src_key, f_source.name)
Below one works fine
import os
bucket = 'bucketname'
key = '/2020/09/reporting_2020_09_20200902'
s3_resource = boto3.resource('s3')
my_bucket = s3_resource.Bucket(bucket)
objects = my_bucket.objects.filter(Prefix=key)
for obj in objects:
path, filename = os.path.split(obj.key)
my_bucket.download_file(obj.key, filename)
I need help how to use wildcard in Airflow
You can list objects that match a given pattern, but then you'll need to write code that decides which one of them is the latest.
Here's the Python SDK function you'll need

ValueError when downloading a file from an S3 bucket using boto3?

I am sucessfully downloading an image file to my local computer from my S3 bucket using the following:
import os
import boto3
import botocore
files = ['images/dog_picture.png']
bucket = 'animals'
s3 = boto3.resource('s3')
for file in files:
s3.Bucket(bucket).download_file(file, os.path.basename(file))
However, when I try to specify the directory to which the image should be saved on my local machine as is done in the docs:
s3.Bucket(bucket).download_file(file, os.path.basename(file), '/home/user/storage/new_image.png')
I get:
ValueError: Invalid extra_args key '/home/user/storage/new_image.png', must be one of: VersionId, SSECustomerAlgorithm, SSECustomerKey, SSECustomerKeyMD5, RequestPayer
I must be doing something wrong but I'm following the example in the docs. Can someone help me specify a local directory?
Looking into the docs, you're providing an extra parameter
import boto3
s3 = boto3.resource('s3')
s3.Bucket('mybucket').download_file('hello.txt', '/tmp/hello.txt')
From the docs, hello.txt is the name of the object on the bucket and /tmp/hello.txt is the path on your device, so the correct way would be
s3.Bucket(bucket).download_file(file, '/home/user/storage/new_image.png')

How to list S3 bucket Delimiter paths

How to list S3 bucket Delimiter paths?
Basically I want to list all of the "directories" and or "sub-directories" in a s3 bucket. I know these don't physically exist. Basically I want all the objects that contain the delimiter and then only return the key path before for the delimiter. Starting under a prefix would be even better but at the bucket level should be enough.
Example S3 Bucket:
root.json
/2018/cats/fluffy.png
/2018/cats/gary.png
/2018/dogs/rover.png
/2018/dogs/jax.png
I would like to then do something like:
s3_client = boto3.client('s3')
s3_client.list_objects(only_show_delimiter_paths=True)
Result
/2018/
/2018/cats/
/2018/dogs/
I don't see any way to do this natively using: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.list_objects
I could pull all the object names and do this in my application code but that seems inefficient.
The Amazon S3 page in boto3 has this example:
List top-level common prefixes in Amazon S3 bucket
This example shows how to list all of the top-level common prefixes in an Amazon S3 bucket:
import boto3
client = boto3.client('s3')
paginator = client.get_paginator('list_objects')
result = paginator.paginate(Bucket='my-bucket', Delimiter='/')
for prefix in result.search('CommonPrefixes'):
print(prefix.get('Prefix'))
But, it only shows top-level prefixes.
So, here's some code to print all the 'folders':
import boto3
client = boto3.client('s3')
objects = client.list_objects_v2(Bucket='my-bucket')
keys = [o['Key'] for o in objects['Contents']]
folders = {k[:k.rfind('/')+1] for k in keys if k.rfind('/') != -1}
print ('\n'.join(folders))

Categories