Download specific files from AWS S3 bucket using boto3 - python

I am trying to download to a local machine specific files from an S3 bucket.
The Bucket structure is as follow:
BucketName/TT/2019/07/23/files.pdf
I want to download all files under:
BucketName/TT/2019/07/23
How can this be done?

Please try this:
import boto3
s3 = boto3.resource('s3')
bucket = s3.Bucket('BucketName')
for obj in bucket.objects.filter(Prefix='TT/2019/07/23/'):
filename = obj.key.split("/").pop()
if filename != "":
print('Downloading ', obj.key)
bucket.download_file(obj.key, filename)
Note that you will need to configure aws first by setting up authentication credentials. Please refer to the quick start guide to see how to do that.

Related

Cannot Access Subfolder of S3 bucket – Python, Boto3

I have been given access to a subfolder of an S3 bucket, and want to access all files inside using Python and boto3. I am new to S3 and have read the docs to death, but haven't been able to figure out how to successfully access just one subfolder. I understand that s3 does not use unix-like directory structure, but I don't have access to the root bucket.
How can I configure boto3 to just connect to this subfolder?
I have successfully used this AWS CLI command to download the entire subfolder to my machine:
aws s3 cp --recursive s3://s3-bucket-name/SUB_FOLDER/ /Local/Path/Where/Files/Download/To --profile my-profile
This code:
AWS_BUCKET='s3-bucket-name'
s3 = boto3.client("s3", region_name='us-east-1', aws_access_key_id=AWS_KEY_ID, aws_secret_access_key=AWS_SECRET)
response = s3.list_objects(Bucket=AWS_BUCKET)
Returns this error:
botocore.exceptions.ClientError: An error occurred (AccessDenied) when calling the ListObjects operation: Access Denied
I have also tried specifying the 'prefix' option in the call to list_objects, but this produces the same error.
You want to aws configure and save have your credentials and region then using boto3 is simple and easy.
Use boto3.resource and get the client like this:
s3_resource = boto3.resource('s3')
s3_client = s3_resource.meta.client
s3_client.list_objects(Bucket=AWS_BUCKET)
You should be good to go.

File Migration from EC2 to S3

We are currently creating a website that is kind of an upgrade to an old existing one. We would like to keep the old posts (that include images) in the new website. The old files are kept in an ec2 instance while the new website is serverless and keeps all it's files in s3.
My question, is there any way I could transfer the old files (from ec2) to the new s3 bucket using Python. I would like rename and relocate the files in the new filename/filepathing pattern that we devs decided.
There is boto3, python aws toolkit.
https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-uploading-files.html
import logging
import boto3
from botocore.exceptions import ClientError
def upload_file(file_name, bucket, object_name=None):
"""Upload a file to an S3 bucket
:param file_name: File to upload
:param bucket: Bucket to upload to
:param object_name: S3 object name. If not specified then file_name is used
:return: True if file was uploaded, else False
"""
# If S3 object_name was not specified, use file_name
if object_name is None:
object_name = file_name
# Upload the file
s3_client = boto3.client('s3')
try:
response = s3_client.upload_file(file_name, bucket, object_name)
except ClientError as e:
logging.error(e)
return False
return True
You can write script with S3 upload_file function, then run on your ec2 locally.

set GCP enviroment variables to access a bucket

I'm working with GCP buckets to store data, my first approach to read write files into/from the buckets was:
def upload_blob(bucket_name, source_file_name, destination_blob_name,credentials):
"""Uploads a file to the bucket."""
client = storage.Client.from_service_account_json(credentials)
bucket = client.get_bucket(bucket_name)
blob = bucket.blob(destination_blob_name)
blob.upload_from_filename(source_file_name)
print('File {} uploaded to {}.'.format(
source_file_name,
destination_blob_name))
def download_blob(bucket_name, source_blob_name, destination_file_name,credentials):
"""Downloads a blob from the bucket."""
storage_client = storage.Client.from_service_account_json(credentials)
bucket = storage_client.get_bucket(bucket_name)
blob = bucket.blob(source_blob_name)
blob.download_to_filename(destination_file_name)
print('Blob {} downloaded to {}.'.format(
source_blob_name,
destination_file_name))
Which does works fine, but I wan to save it as envioment variables to not keep the files around all the time. As I understand it, google if the credentials are not provided, changing:
client = storage.Client.from_service_account_json(credentials)
for:
client = storage.Client()
Then google with search for the default credentials, which can be set by doing:
export GOOGLE_APPLICATION_CREDENTIALS="/home/user/Downloads/[FILE_NAME].json"
Which I'm doing and don't get any error:
But when I try to access to the bucket I get the following error:
DefaultCredentialsError: Could not automatically determine credentials. Please set GOOGLE_APPLICATION_CREDENTIALS or explicitly create credentials and re-run the application. For more information, please see https://cloud.google.com/docs/authentication/getting-started
I'm following the link, creating a new key:
And trying whit that one instead, but still getting the same error.
You are getting the error because, the interpreter unable to fetch the login credentials needed to proceed.
client = storage.Client.from_service_account_json(credentials)
On this line, are you mentioning just credentials or the path of the file, can you try importing the serviceaccount.json file and then try.
Solution:please refer below code snippet,
from google.cloud import storage client = storage.Client.from_service_account_json('serviecaccount.json')
where as the 'serviecaccount.json' file i have kept in the same project repo

Python S3 Amazon Code with 'Access Denied' Error

I am trying to download a specific S3 file off a server using Python Boto and am getting "403 Forbidden" and "Access Denied" error messages. It says the error is occurring at line 24 (get_contents command). I have tried it with and without the "aws s3 cp" at the start of the source file path, received the same error message both time. My code is below, any advice would be helpful.
# Code to append csv:
import csv
import boto
from boto.s3.key import Key
keyId ="key"
sKeyId="secretkey"
srcFileName="aws s3 cp s3://...."
destFileName="C:\\Users...."
bucketName="bucket00001"
conn = boto.connect_s3(keyId,sKeyId)
bucket = conn.get_bucket(bucketName, validate = False)
#Get the Key object of the given key, in the bucket
k = Key(bucket, srcFileName)
#Get the contents of the key into a file
k.get_contents_to_filename(destFileName)
AWS is very vague with the errors that it outputs. This is intentional, but it definitely doesn't help with debugging. You are receiving an Access Denied error because the source file name you are using is not the correct path for the file.
aws s3 cp
This is the CLI command to copy one or more files from a source to a destination (with either being an s3 bucket). This should not be apart of the source file name.
s3://...
This prefix is appended to your bucket name that denotes that the path refers to an s3 object, however, this is not necessary in your source file path name when using boto3.
To download an s3 file using boto3, perform the following:
import boto3
BUCKET_NAME = 'my-bucket' # does not include s3://
KEY = 'image.jpg' # the file you want to download
s3 = boto3.resource('s3')
s3.Bucket(BUCKET_NAME).download_file(KEY, 'image.jpg')
Documentation for this command can be found here:
https://boto3.readthedocs.io/en/latest/guide/s3-example-download-file.html
In general, boto3 (and any other AWS SDK's) are simply wrappers around AWS api requests. You can also use the aws cli like I mentioned earlier to download a file from s3. That command would be:
aws s3 cp s3://my-bucket/my-file.jpg C:\location\my-file.jpg
srcFileName="aws s3 cp s3://...."
This has to be a key like somefolder/somekey or somekey as string.
You are providing a path or command to it.

How to recursively list files in AWS S3 bucket using AWS SDK for Python?

I am trying to replicate the AWS CLI ls command to recursively list files in an AWS S3 bucket. For example, I would use the following command to recursively list all of the files in the "location2" bucket.
aws s3 ls s3://location2 --recursive
What is the AWS SDK for Python (i.e. boto3) equivalent of aws s3 ls s3://location2 --recursive?
You'd need to use paginators:
import boto3
client = boto3.client("s3")
bucket = "my-bucket"
paginator = client.get_paginator('list_objects')
page_iterator = paginator.paginate(Bucket=bucket)
for page in page_iterator:
for obj in page['Contents']:
print(f"s3://{bucket}/{obj["Key"]}")
There is no need to use the --recursive option while using the AWS SDK as it lists all the objects in the bucket using the list_objects method.
import boto3
client = boto3.client('s3')
client.list_objects(Bucket='MyBucket')
Using the higher level API and use resources is the way to go.
import boto3
s3 = boto3.resource('s3')
bucket = s3.Bucket('location2')
bucket_files = [x.key for x in bucket.objects.all()]
You can also use minio-py client library, its open source & compatible with AWS S3.
list_objects.py example below, you can refer to the docs for additional information.
from minio import Minio
client = Minio('s3.amazonaws.com',
access_key='YOUR-ACCESSKEYID',
secret_key='YOUR-SECRETACCESSKEY')
# List all object paths in bucket that begin with my-prefixname.
objects = client.list_objects('my-bucketname', prefix='my-prefixname',
recursive=True)
for obj in objects:
print(obj.bucket_name, obj.object_name.encode('utf-8'), obj.last_modified,
obj.etag, obj.size, obj.content_type)
Hope it helps.
Disclaimer: I work for Minio
aws s3 ls s3://logs/access/20230104/14/ --recursive
To list all files complete path along with error handling
s3_client = boto3.client('s3')
paginator = s3_client.get_paginator('list_objects_v2')
pages = paginator.paginate(Bucket="logs", Prefix="access/20230104/14/")
for page in pages:
try:
for obj in page['Contents']:
print(obj['Key'])
except KeyError:
print("No files exist")
exit(1)

Categories