Why is amazon S3 clients location dependents? - python

I am playing with amazon-s3. My use case is to just list keys starting with a prefix.
import boto3
s3 = boto3.client('s3', "eu-west-1")
response = s3.list_objects_v2(Bucket="my-bucket", MaxKeys=1, Prefix="my/prefix/")
for content in response['Contents']:
print(content['Key'])
my-bucket is located in use1. I am surprised that my python client from euw1 is able to do such requests. For reference, my scala client:
val client = AmazonS3ClientBuilder.standard().build()
client.listObjectsV2("my-bucket", "my-prefix")
which gives an error
com.amazonaws.services.s3.model.AmazonS3Exception: The bucket is in this region: us-east-1. Please use this region to retry the request
which is expected.
My question is, why the s3Client is location dependant ? Is there any advantage to choose the right location ? Is there any hidden cost to not match the location ?

My question is, why the s3Client is location dependant
Because buckets are regional resources even if they pretend not to be. Although a S3 bucket is globally accessible, the underlying resources are still hosted in a specific underlying AWS region. If you're using the AWS client sdks to access the resources, you need to be connecting to the bucket's regional S3 endpoint.
Is there any advantage to choose the right location?
Lower latency. If your services are in eu-west-1 it makes sense to have your buckets there too. You also will not pay cross-region data transfer rates, but rather AWS's internal region rate.
Is there any hidden cost to not match the location?
Yes. Costs for data egress vary based on region, and you pay more to send data from one region to another than you will to send data between services in the same region.
As to why the boto3 library is not raising an error, it is possibly interrogating the S3 api under the hood to establish where the bucket is located before issuing the list_objects_v2 call.

Related

How to load JSON file from API call to S3 bucket without using Secret Access Key and ID ? looking for common industry practice

I am making calls to API using python request library, and I am receiving the response in JSON. Currently I am saving JSON response on local computer, what I would like to do is to load JSON response directly to s3 bucket. The reason for loading to s3 bucket is my s3 bucket is acting as source to parse the JSON response for relational output. I was wondering how can I load JSON file directly to s3 bucket without using Access key or secret key ?
Most of my research on this topic lead to usingboto3 in python. Unfortunately, this library also requires key and ID. The reason for not using secret key and ID is because my organization has separate department which takes care of giving access to s3 bucket, and the department can only create IAM role with read and write access. I am curious what is the common industry practice of loading JSON in your organization ?
You can make unsigned requests for S3 through VPC Endpoint (VPCE), and don't need any AWS credentials this way.
# https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-example-privatelink.html
s3 = boto3.client('s3', config=Config(signature_version=UNSIGNED), endpoint_url="https://bucket.vpce-xxx-xxx.s3.ap-northeast-1.vpce.amazonaws.com")
You can restrict source ip by setting security group in VPC Endpoint to protect your S3 Bucket. Note that the owner of s3 objects uploaded by unsigned requests is anonymous, and may cause some side effects. In my case, Lifecycle rules cannot apply to those s3 objects.

Move files from one s3 bucket to another in AWS using AWS lambda

I am trying to move files older than a hour from one s3 bucket to another s3 bucket using python boto3 AWS lambda function with following cases:
Both buckets can be in same account and different region.
Both buckets can be in different account and different region.
Both buckets can be in different account and same region.
I got some help to move files using the python code mentioned by #John Rotenstein
import boto3
from datetime import datetime, timedelta
SOURCE_BUCKET = 'bucket-a'
DESTINATION_BUCKET = 'bucket-b'
s3_client = boto3.client('s3')
# Create a reusable Paginator
paginator = s3_client.get_paginator('list_objects_v2')
# Create a PageIterator from the Paginator
page_iterator = paginator.paginate(Bucket=SOURCE_BUCKET)
# Loop through each object, looking for ones older than a given time period
for page in page_iterator:
for object in page['Contents']:
if object['LastModified'] < datetime.now().astimezone() - timedelta(hours=1): # <-- Change time period here
print(f"Moving {object['Key']}")
# Copy object
s3_client.copy_object(
Bucket=DESTINATION_BUCKET,
Key=object['Key'],
CopySource={'Bucket':SOURCE_BUCKET, 'Key':object['Key']}
)
# Delete original object
s3_client.delete_object(Bucket=SOURCE_BUCKET, Key=object['Key'])
How can this be modified to cater the requirement
An alternate approach would be to use Amazon S3 Replication, which can replicate bucket contents:
Within the same region, or between regions
Within the same AWS Account, or between different Accounts
Replication is frequently used when organizations need another copy of their data in a different region, or simply for backup purposes. For example, critical company information can be replicated to another AWS Account that is not accessible to normal users. This way, if some data was deleted, there is another copy of it elsewhere.
Replication requires versioning to be activated on both the source and destination buckets. If you require encryption, use standard Amazon S3 encryption options. The data will also be encrypted during transit.
You configure a source bucket and a destination bucket, then specify which objects to replicate by providing a prefix or a tag. Objects will only be replicated once Replication is activated. Existing objects will not be copied. Deletion is intentionally not replicated to avoid malicious actions. See: What Does Amazon S3 Replicate?
There is no "additional" cost for S3 replication, but you will still be charge for any Data Transfer charges when moving objects between regions, and for API Requests (that are tiny charges), plus storage of course.
Moving between regions
This is a non-issue. You can just copy the object between buckets and Amazon S3 will figure it out.
Moving between accounts
This is a bit harder because the code will use a single set of credentials must have ListBucket and GetObject access on the source bucket, plus PutObject rights to the destination bucket.
Also, if credentials are being used from the Source account, then the copy must be performed with ACL='bucket-owner-full-control' otherwise the Destination account won't have access rights to the object. This is not required when the copy is being performed with credentials from the Destination account.
Let's say that the Lambda code is running in Account-A and is copying an object to Account-B. An IAM Role (Role-A) is assigned to the Lambda function. It's pretty easy to give Role-A access to the buckets in Account-A. However, the Lambda function will need permissions to PutObject in the bucket (Bucket-B) in Account-B. Therefore, you'll need to add a bucket policy to Bucket-B that allows Role-A to PutObject into the bucket. This way, Role-A has permission to read from Bucket-A and write to Bucket-B.
So, putting it all together:
Create an IAM Role (Role-A) for the Lambda function
Give the role Read/Write access as necessary for buckets in the same account
For buckets in other accounts, add a Bucket Policy that grants the necessary access permissions to the IAM Role (Role-A)
In the copy_object() command, include ACL='bucket-owner-full-control' (this is the only coding change needed)
Don't worry about doing any for cross-region, it should just work automatically

How to transfer a file from one S3 bucket to other with two different users

I need to write code (python) to copy an S3 file from one S3 bucket to another. The source bucket is in a different AWS account, and we are using an IAM user credentials to read from that bucket. The code runs in the same account as the destination bucket, so it has write access with the IAM role. One way I can think of is to create an s3 client connection with the source account, read the whole file into memory (getObject-?), and then create another s3 client with the destination bucket and write the contents (putObject-?) that have been previously read into memory. But it can get very inefficient if the file size grows, so wondering if there is a better way, preferably if boto3 provides a AWS-managed way that transfers the file without reading contents into memory.
PS: I cannot add or modify roles or policies in the source account to give direct read access to the destination account. The source account is owned by someone else and they only provide a user that can read from the bucket.
Streaming is the standard solution for this kind of problem. You establish a source and a destination and then you stream from one to the other.
In fact, the boto3 get_object() and upload_fileobj() methods both support streams.
Your code is going to look something like this:
import boto3
src = boto3.client('s3', src_access_key, src_secret_key)
dst = boto3.client('s3') # creds implicit through IAM role
src_response = src.get_object(Bucket=src_bucket, Key=src_key)
dst.upload_fileobj(src_response['Body'], dst_bucket, dst_key)
This is just a suggestion that might provide an updated approach. Most tech articles about how to transfer S3 files from one account to another rely on the destination account to "pull" the files so that the destination account ends up owning the copied files.
However, per this article from AWS, you can now configure buckets with a Bucket owner enforced setting—and in fact this is the default for newly created buckets:
Objects in Amazon S3 are no longer automatically owned by the AWS
account that uploads it. By default, any newly created buckets now
have the Bucket owner enforced setting enabled.
On the destination bucket, you should be able to grant IAM permission for the source account user to "push" files to that bucket. Then with appropriate S3 commands or API calls, you should be able to copy files directly from the source to the destination without needing to read, buffer, and write data with your Python client.
You might want to test and verify the permissions configuration with the AWS CLI, and then determine how to implement it in Python.

parallell copy of buckets/keys from boto3 or boto api between 2 different accounts/connections

I want to copy keys from buckets between 2 different accounts using boto3 api's.
In boto3, I executed the following code and the copy worked
source = boto3.client('s3')
destination = boto3.client('s3')
destination.put_object(source.get_object(Bucket='bucket', Key='key'))
Basically I am fetching data from GET and pasting that with PUT in another account.
On Similar lines in boto api, I have done the following
source = S3Connection()
source_bucket = source.get_bucket('bucket')
source_key = Key(source_bucket, key_name)
destination = S3Connection()
destination_bucket = destination.get_bucket('bucket')
dist_key = Key(destination_bucket, source_key.key)
dist_key.set_contents_from_string(source_key.get_contents_as_string())
The above code achieves the purpose of copying any type of data.
But the speed is really very slow. I get around 15-20 seconds to copy data for 1GB. And I have to copy 100GB plus.
I tried python mutithreading wherein each thread does the copy operation. The performance was bad as it took 30 seconds to copy 1GB. I suspect GIL might be the issue here.
I did multiprocessing and I am getting the same result as of single process i.e. 15-20 seconds for 1GB file.
I am using a very high end server with 48 cores and 128GB RAM. The network speed in my environment is 10GBPS.
Most of the search results tell about copying data between buckets in same account and not across accounts. Can anyone please guide me here. Is my approach wrong? Does anyone have a better solution?
Yes, it is wrong approach.
You shouldn't download the file. You are using AWS infrastructure, so you should make use of the efficient AWS backend call to do the works. Your approach is wasting resources.
boto3.client.copy will do the job better than this.
In addition, you didn't describe what you are trying to achieve(e.g. is this some sort of replication requirement? ).
Because with proper understanding of your own needs, it is possible that you don't even need a server to do the job : S3 Bucket events trigger, lambda etc can all execute the copying job without a server.
To copy file between two different AWS account, you can checkout this link Copy S3 object between AWS account
Note :
S3 is a huge virtual object store for everyone, that's why the bucket name MUST be unique. This also mean, the S3 "controller" can done a lot of fancy work similar to a file server , e.g. replication,copy, move file in the backend, without involving network traffics.
As long as you setup the proper IAM permission/policies for the destination bucket, object can move across bucket without additional server.
This is almost similar to file server. User can copy file to each other without "download/upload", instead, one just create a folder with write permission for all, file copy from another user is all done within the file server, with fastest raw disk I/O performance. You don't need powerful instance nor high performance network using backend S3 copy API.
Your method is similar to attempt FTP download file from user using the same file server, which create unwanted network traffics.
You should check out the TransferManager in boto3. It will automatically handle the threading of multipart uploads in an efficient way. See the docs for more detail.
Basically you must have to use the upload_file method and TransferManager will take care of the rest.
import boto3
# Get the service client
s3 = boto3.client('s3')
# Upload tmp.txt to bucket-name at key-name
s3.upload_file("tmp.txt", "bucket-name", "key-name")

Given an archive_id, how might I go about moving an archive from AWS Glacier to an S3 Bucket?

I have written an archival system with Python Boto that tar's several dirs of files and uploads to Glacier. This is all working great and I am storing all of the archive ID's.
I wanted to test downloading a large archive (about 120GB). I initiated the retrieval, but the download took > 24 hours and at the end, I got a 403 since the resource was no longer available and the download failed.
If I archived straight from my server to Glacier (skipping S3), is it possible to initiate a restore that restores an archive to an S3 bucket so I can take longer than 24 hours to download a copy? I didn't see anything in either the S3 or Glacier Boto docs.
Ideally I'd do this with Boto but would be open to other scriptable options. Does anyone know how given an archiveId, I might go about moving an archive from AWS Glacier to an S3 Bucket? If this is not possible, are there other options to give my self more time to download large files?
Thanks!
http://docs.pythonboto.org/en/latest/ref/glacier.html
http://docs.pythonboto.org/en/latest/ref/s3.html
The direct Glacier API and the S3/Glacier integration are not connected to each other in a way that is accessible to AWS users.
If you upload directly to Glacier, the only way to get the data back is to fetch it back directly from Glacier.
Conversely, if you add content to Glacier via S3 lifecycle policies, then there is no exposed Glacier archive ID, and the only way to get the content is to do an S3 restore.
It's essentially as if "you" aren't the Glacier customer, but rather "S3" is the Glacier customer, when you use the Glacier/S3 integration. (In fact, that's a pretty good mental model -- the Glacier storage charges are even billed differently -- files stored through the S3 integration are billed together with the other S3 charges on the monthly invoice, not with the Glacier charges).
The way to accomplish what you are directly trying to accomplish is to do range retrievals, where you only request that Glacier restore a portion of the archive.
Another reason you could choose to perform a range retrieval is to manage how much data you download from Amazon Glacier in a given period. When data is retrieved from Amazon Glacier, a retrieval job is first initiated, which will typically complete in 3-5 hours. The data retrieved is then available for download for 24 hours. You could therefore retrieve an archive in parts in order to manage the schedule of your downloads. You may also choose to perform range retrievals in order to reduce or eliminate your retrieval fees.
— http://aws.amazon.com/glacier/faqs/
You'd then need to reassemble the pieces. That last part seems like a big advantage also, since Glacier does charge more, the more data you "restore" at a time. Note this isn't a charge for downloading the data, it's a charge for the restore operation, whether you download it or not.
One advantage I see of the S3 integration is that you can leave your data "cooling off" in S3 for a few hours/days/weeks before you put it "on ice" in Glacier, which happens automatically... so you can fetch it back from S3 without paying a retrieval charge, until it's been sitting in S3 for the amount of time you've specified, after which it automatically migrates. The potential downside is that it seems to introduce more moving parts.
Using document lifecycle policies you can move files directly from S3 to Glacier and you can also restore those object back to S3 using the restore method of the boto.s3.Key object. Also, see this section of the S3 docs for more information on how restore works.

Categories