I have credentials ('aws access key', 'aws secret key', and a path) for a dataset stored on AWS S3. I can access the data by using CyberDuck or FileZilla Pro.
I would like to automate the data fetch stage and using Python/Anaconda, which comes with boto2, for this purpose.
I do not have a "bucket" name, just a path in the form of /folder1/folder2/folder3 and I could not find a way to access the data without a "bucket name" with the API.
Is there a way to access S3 programatically without having a "bucket name", i.e. with a path instead?
Thanks
s3 does not have a typical native directory/folder structure, instead, it is defined with keys. If the URL starts with s3://dir_name/folder_name/file_name, it means dir_name is nothing but a bucket name. If you are not sure about bucket name but have s3 access parameters and path, then you can
List all the s3_buckets available -
s3 = boto3.client('s3')
response = s3.list_buckets()
Use s3.client.head_object() method recursively for each bucket with your path as key.
Related
I am making calls to API using python request library, and I am receiving the response in JSON. Currently I am saving JSON response on local computer, what I would like to do is to load JSON response directly to s3 bucket. The reason for loading to s3 bucket is my s3 bucket is acting as source to parse the JSON response for relational output. I was wondering how can I load JSON file directly to s3 bucket without using Access key or secret key ?
Most of my research on this topic lead to usingboto3 in python. Unfortunately, this library also requires key and ID. The reason for not using secret key and ID is because my organization has separate department which takes care of giving access to s3 bucket, and the department can only create IAM role with read and write access. I am curious what is the common industry practice of loading JSON in your organization ?
You can make unsigned requests for S3 through VPC Endpoint (VPCE), and don't need any AWS credentials this way.
# https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-example-privatelink.html
s3 = boto3.client('s3', config=Config(signature_version=UNSIGNED), endpoint_url="https://bucket.vpce-xxx-xxx.s3.ap-northeast-1.vpce.amazonaws.com")
You can restrict source ip by setting security group in VPC Endpoint to protect your S3 Bucket. Note that the owner of s3 objects uploaded by unsigned requests is anonymous, and may cause some side effects. In my case, Lifecycle rules cannot apply to those s3 objects.
I have code that will automate CSVs and it will create a dir using makedirs with uuid dirname. The code is working on my local machine but not in S3.
I am using an href to download the csv file by passing file_path in context.
views.py
def makedirs(path):
try:
os.makedirs(path)
except OSError as e:
if e.errno != errno.EEXIST:
raise
return path
def ..
tmp_name = str(uuid.uuid4())
file_path = 'static/tmp/'+tmp_name+'/'
file_path = makedirs(file_path)
reviews_df.to_csv(file_path+'file_processed.csv', index=False)
Thanks a lot!
For python and s3 you need to have permissions to write to the bucket and you need to know the bucket name. Here, I'll assume you have both.
In the AWS python SDK boto3 you have either a client or a resource-although not all services will have both the most commonly used ones usually do.
# s3 client
import boto3
s3_client = boto3.client('s3')
s3_client.upload_file(local_file_name, bucket_name, key_in_s3)
The client is usually a lower level aspect of the AWS service and is usually more useful if you are writing some infrastructure code yourself. A resource-usually-is a higher level of abstraction.
# s3 resource
s3_resource = boto3.resource('s3')
bucket = s3_resource.Bucket(bucket_name)
bucket.upload_file(local_file_name, key_in_s3)
Also choosing to use a resource does not preclude you from using a client for the same AWS service. You can do so by using the meta attribute on the resource. For example to use the client upload way above but using the s3 resource rather than client you would do:
# client from resource
s3_resource.meta.client.upload_file(local_file_name, bucket_name, key_in_s3)
but usually you wouldn't do that because the bucket under the resource has the option for you already.
Even though the key_in_s3 doesn't need to be the same as what you have in your local filesystem it is likely a good idea (for your sanity) to use the same value unless you have requirements otherwise.
Here it sounds like your requirement is to have the s3 key be a uuid. If you plan to download the files at a future point and want to avoid the hassle of selecting the application you might want to include the .csv extension after the uuid when you upload to s3. Not a requirement from the s3 side though.
Q: How to make a directory?
A: The bucket is really the directory. Inside a bucket, there isn't really a concept of directory in S3. The / characters are used to delineate lists of objects in the AWS console and in the AWS CLI but that is more of a convenience. If you want to have that then you can construct it in the code where you write the s3 key. The In other words the way you have your file path and file name in a single string should work fine.
I am encountering you with the request for help with listing the objects in my CloudCube bucket. I am developing a Django application hosted on Heroku. I am using the CloudCube add-on for persistent storage. CloudCube is running on AWS S3 Bucket and CloudCube provides private Key/Namespace in order to access my files. I use the boto3 library to access the bucket and everything works fine when I want to upload/download the file; however, I am struggling with the attempts to list objects in that particular bucket with the CloudCube prefix key. On any request, I receive AccessDennied Exception.
To access the bucket I use the following implementation:
s3_client = boto3.client('s3', aws_access_key_id=settings.AWS_ACCESS_KEY_ID,
aws_secret_access_key=settings.AWS_SECRET_ACCESS_KEY,
endpoint_url=settings.AWS_S3_ENDPOINT_URL, region_name='eu-west-1')
s3_result = s3_client.list_objects(Bucket=settings.AWS_STORAGE_BUCKET_NAME, Prefix=settings.CLOUD_CUBE_KEY)
if 'Contents' not in s3_result:
return []
file_list = []
for key in s3_result['Contents']:
if f"{username}/{mode.value}" in key['Key']:
file_list.append(key['Key'])
As a bucket name, I am using the prefix in the URI that aims to the CloudCube bucket on the AWS according to their documentation: https://BUCKETNAME.s3.amazonaws.com/CUBENAME. CUBENAME is used then as a Prefix Key.
Does anyone have clue what do I miss?
Thank you in advance!
According to CloudCube's documentation, you need a trailing slash on the prefix to list the directory.
So you should update your code like this to make it work:
s3_result = s3_client.list_objects(Bucket=settings.AWS_STORAGE_BUCKET_NAME, Prefix=f'{settings.CLOUD_CUBE_KEY}/')
I'm trying to download AWS S3 content using Python/Boto3.
A third-party is uploading a data, and I need to download it.
They provided credentials like this:
Username : MYUser
aws_access_key_id : SOMEKEY
aws_secret_access_key : SOMEOTHERKEY
Using a popular Windows 10 app CyberDuck, my 'Username' is added to the application's path settings, third-party/MYUser/myfolder
Nothing I'm given here is my bucket.
my_bucket = s3.Bucket('third-party/MYUser')
ParamValidationError: Parameter validation failed:
Invalid bucket name 'third-party/MYUser': Bucket name must match the regex "^[a-zA-Z0-9.\-_]{1,255}$"
my_bucket = s3.Bucket('third-party')
ClientError: An error occurred (AccessDenied) when calling the ListObjects operation: Access Denied
my_bucket = s3.Bucket('MYStuff')
NoSuchBucket: An error occurred (NoSuchBucket) when calling the ListObjects operation: The specified bucket does not exist
From what I've read, third-party is the AWS S3 bucket name, but I can't find an explanation for how to access a sub-directory of someone else's bucket.
I'm see Bucket() has some user parameters. I read elsewhere about roles, and access control lists. But I'm not finding a simple example.
How do I access someone else's bucket on AWS S3 given Username?
Amazon S3 does not actually have directories. Rather, the Key (filename) of an object contains the full path of the object.
For example, consider this object:
s3://my-bucket/invoices/foo.txt
The bucket is my-bucket
The Key of the object is invoices/foo.txt
So, you could access the object with:
import boto3
s3_resource = boto3.resource('s3')
object = s3.Object('my-bucket','invoices/foo.txt')
To keep S3 compatible with systems and humans who expect to have folders and directories, it maintains a list of CommonPrefixes, which are effectively the same as directories. They are derived from the names between slashes (/). So, CyberDuck can give users the ability to navigate through directories.
However, the third-party might have only assigned you enough permission to access your own directory, but not the root directory. In this case, you will need to navigate straight to your directory without clicking through the hierarchy.
A good way to use an alternate set of credentials is to store them as a separate profile:
aws configure --profile third-party
You will then be prompted for the credentials.
Then, you can use the credentials like this:
aws s3 ls s3://third-party/MyUser --profile third-party
aws s3 cp s3://third-party/MyUser/folder/foo.txt .
The --profile at the end lets you select which credentials to use.
The boto3 equivalent is:
session = boto3.Session(profile_name='third-party')
s3_resource = session.resource('s3')
object = s3.Object('THEIR-bucket','MYinvoices/foo.txt')
See: Credentials — Boto 3 Documentation
I need to write code (python) to copy an S3 file from one S3 bucket to another. The source bucket is in a different AWS account, and we are using an IAM user credentials to read from that bucket. The code runs in the same account as the destination bucket, so it has write access with the IAM role. One way I can think of is to create an s3 client connection with the source account, read the whole file into memory (getObject-?), and then create another s3 client with the destination bucket and write the contents (putObject-?) that have been previously read into memory. But it can get very inefficient if the file size grows, so wondering if there is a better way, preferably if boto3 provides a AWS-managed way that transfers the file without reading contents into memory.
PS: I cannot add or modify roles or policies in the source account to give direct read access to the destination account. The source account is owned by someone else and they only provide a user that can read from the bucket.
Streaming is the standard solution for this kind of problem. You establish a source and a destination and then you stream from one to the other.
In fact, the boto3 get_object() and upload_fileobj() methods both support streams.
Your code is going to look something like this:
import boto3
src = boto3.client('s3', src_access_key, src_secret_key)
dst = boto3.client('s3') # creds implicit through IAM role
src_response = src.get_object(Bucket=src_bucket, Key=src_key)
dst.upload_fileobj(src_response['Body'], dst_bucket, dst_key)
This is just a suggestion that might provide an updated approach. Most tech articles about how to transfer S3 files from one account to another rely on the destination account to "pull" the files so that the destination account ends up owning the copied files.
However, per this article from AWS, you can now configure buckets with a Bucket owner enforced setting—and in fact this is the default for newly created buckets:
Objects in Amazon S3 are no longer automatically owned by the AWS
account that uploads it. By default, any newly created buckets now
have the Bucket owner enforced setting enabled.
On the destination bucket, you should be able to grant IAM permission for the source account user to "push" files to that bucket. Then with appropriate S3 commands or API calls, you should be able to copy files directly from the source to the destination without needing to read, buffer, and write data with your Python client.
You might want to test and verify the permissions configuration with the AWS CLI, and then determine how to implement it in Python.