Clone entire instance of S3 bucket to another bucket - python

So I need to clone entire instance via AWS SDK for Python (boto3). But my instance have more that 5 millions objects so calling objects_response = client.list_objects_v2(Bucket=bucket_name) recursivly and then perform a copy on each file is taking too much time and not secure due to process fail and starting over with that amount of files. So how to make it in a fast and more secure way?

AWS CLI s3 sync
The AWS Command-Line Interface (CLI) aws s3 sync command can copy S3 objects in parallel.
You can adjust the settings to enable more simultaneous copies via the AWS CLI S3 Configuration file.
The sync command uses CopyObject() to copy objects, which tells S3 to copy the objects between buckets. Therefore, no data is downloaded/uploaded -- it just sends commands to S3 to manage the copy.
Running the sync command from an Amazon EC2 instance will reduce network latency, resulting in a faster copy (especially for many, smaller objects).
You could improve copy speed by running several copies of aws s3 sync (preferably from multiple computers). For example, each could be responsible for copying a separate sub-directory.
See also: Improve transfer performance of sync command in Amazon S3
Amazon S3 Batch Operations
Amazon S3 itself can also perform a Copy operation on a large number of files by using Amazon S3 Batch Operations:
First, create an Amazon S3 Inventory report that lists all the objects (or supply a CSV file with object names)
Then, create a S3 Batch Operations "Copy" job, pointing to the destination bucket
The entire process will be managed by Amazon S3.

Related

How to continuously copy new S3 files to another S3 bucket

How can I continuously copy one S3 bucket to another? I want to copy the files every time a new file has been added.
I've tried using the boto3 copy_object however I require the key each time which won't work if I'm getting a new file each time.
From Replicating objects - Amazon Simple Storage Service:
To automatically replicate new objects as they are written to the bucket use live replication, such as Same-Region Replication (SRR) or Cross-Region Replication (CRR).
S3 Replication will automatically create new objects in another bucket as soon as they are created. (Well, it can take a few seconds.)
Alternatively, you could configure the S3 bucket to trigger an AWS Lambda function that uses the CopyObject() command to copy the object to another location. This method is useful if you want to selectively copy files, by having the Lambda function perform some logic before performing the copy (such as checking the file type).
Please look at this: https://aws.amazon.com/premiumsupport/knowledge-center/move-objects-s3-bucket/
You can use the aws cli s3 sync command to achieve this.

copy file from AWS S3 (as static website) to AWS S3 [duplicate]

We need to move our video file storage to AWS S3. The old location is a cdn, so I only have url for each file (1000+ files, > 1TB total file size). Running an upload tool directly on the storage server is not an option.
I already created a tool that downloads the file, uploads file to S3 bucket and updates the DB records with new HTTP url and works perfectly except it takes forever.
Downloading the file takes some time (considering each file close to a gigabyte) and uploading it takes longer.
Is it possible to upload the video file directly from cdn to S3, so I could reduce processing time into half? Something like reading chunk of file and then putting it to S3 while reading next chunk.
Currently I use System.Net.WebClient to download the file and AWSSDK to upload.
PS: I have no problem with internet speed, I run the app on a server with 1GBit network connection.
No, there isn't a way to direct S3 to fetch a resource, on your behalf, from a non-S3 URL and save it in a bucket.
The only "fetch"-like operation S3 supports is the PUT/COPY operation, where S3 supports fetching an object from one bucket and storing it in another bucket (or the same bucket), even across regions, even across accounts, as long as you have a user with sufficient permission for the necessary operations on both ends of the transaction. In that one case, S3 handles all the data transfer, internally.
Otherwise, the only way to take a remote object and store it in S3 is to download the resource and then upload it to S3 -- however, there's nothing preventing you from doing both things at the same time.
To do that, you'll need to write some code, using presumably either asynchronous I/O or threads, so that you can simultaneously be receiving a stream of downloaded data and uploading it, probably in symmetric chunks, using S3's Multipart Upload capability, which allows you to write individual chunks (minimum 5MB each) which, with a final request, S3 will validate and consolidate into a single object of up to 5TB. Multipart upload supports parallel upload of chunks, and allows your code to retry any failed chunks without restarting the whole job, since the individual chunks don't have to be uploaded or received by S3 in linear order.
If the origin supports HTTP range requests, you wouldn't necessarily even need to receive a "stream," you could discover the size of the object and then GET chunks by range and multipart-upload them. Do this operation with threads or asynch I/O handling multiple ranges in parallel, and you will likely be able to copy an entire object faster than you can download it in a single monolithic download, depending on the factors limiting your download speed.
I've achieved aggregate speeds in the range of 45 to 75 Mbits/sec while uploading multi-gigabyte files into S3 from outside of AWS using this technique.
This has been answered by me in this question, here's the gist:
object = Aws::S3::Object.new(bucket_name: 'target-bucket', key: 'target-key')
object.upload_stream do |write_stream|
IO.copy_stream(URI.open('http://example.com/file.ext'), write_stream)
end
This is no 'direct' pull-from-S3, though. At least this doesn't download each file and then uploads in serial, but streams 'through' the client. If you run the above on an EC2 instance in the same region as your bucket, I believe this is as 'direct' as it gets, and as fast as a direct pull would ever be.
if a proxy ( node express ) is suitable for you then the portions of code at these 2 routes could be combined to do a GET POST fetch chain, retreiving then re-posting the response body to your dest. S3 bucket.
step one creates response.body
step two
set the stream in 2nd link to response from the GET op in link 1 and you will upload to dest.bucket the stream ( arrayBuffer ) from the first fetch

download file from s3 to local automatically

I am creating a glue job(Python shell) to export data from redshift and store it in S3. But how would I automate/trigger the file in S3 to download to the local network drive so the 3rd party vendor will pick it up.
Without using the glue, I can create a python utility that runs on local server to extract data from redshift as a file and save it in local network drive, but I wanted to implement this framework on cloud to avoid having dependency on local server.
AWS cli sync function wont help as once vendor picks up the file, I should not put it again in the local folder.
Please suggest me the good alternatives.
If the interface team can use S3 API or CLI to get objects from S3 to put on the SFTP server, granting them S3 access through an IAM user or role would probably be the simplest solution. The interface team could write a script that periodically gets the list of S3 objects created after a specified date and copies them to the SFTP server.
If they can't use S3 API or CLI, you could use signed URLs. You'd still need to communicate the S3 object URLs to the interface team. A queue would be a good solution for that. But if they can use an AWS SQS client, I think it's likely they could just use the S3 API to find new objects and retrieve them.
It's not clear to me who controls the SFTP server, whether it's your interface team or the 3rd party vendor. If you can push files to the SFTP server yourself, you could create a S3 event notification that runs a Lambda function to copy the object to the SFTP server every time a new object is created in the S3 bucket.

Split S3 file into smaller files of 1000 lines

I have a text file on S3 with around 300 million lines. I'm looking to split this file into smaller files of 1,000 lines each (with the last file containing the remainder), which I'd then like to put into another folder or bucket on S3.
So far, I've been running this on my local drive using the linux command:
split -l 1000 file
which splits the original file into smaller files of 1,000 lines. However, with a larger file like this, it seems inefficient to download and then re-upload from my local drive back up to S3.
What would be the most efficient way to split this S3 file, ideally using Python (in a Lambda function) or using other S3 commands? Is it faster to just run this on my local drive?
Anything that you do will have to download the file, split it, and re-upload it. The only question is where, and whether local disk is involved.
John Rotenstein gave you an example using local disk on an EC2 instance. This has the benefit of running in the AWS datacenters, so it gets a high-speed connection, but has the limitations that (1) you need disk space to store the original file and its pieces, and (2) you need an EC2 instance where you can do this.
One small optimization is to avoid the local copy of the big file, by using a hyphen as the destination of the s3 cp: this will send the output to standard out, and you can then pipe it into split (here I'm also using a hyphen to tell split to read from standard input):
aws s3 cp s3://my-bucket/big-file.txt - | split -l 1000 - output.
aws s3 cp output.* s3://dest-bucket/
Again, this requires an EC2 instance to run it on, and the storage space for the output files. There is, however, a flag to split that will let you run a shell command for each file in the split:
aws s3 cp s3://src-bucket/src-file - | split -b 1000 --filter 'aws s3 cp - s3://dst-bucket/result.$FILE' -
So now you've eliminated the issue of local storage, but are left with the issue of where to run it. My recommendation would be AWS Batch, which can spin up an EC2 instance for just the time needed to perform the command.
You can, of course, write a Python script to do this on Lambda, and that would have the benefit of being triggered automatically when the source file has been uploaded to S3. I'm not that familiar with the Python SDK (boto), but it appears that get_object will return the original file's body as a stream of bytes, which you can then iterate over as lines, accumulating however many lines you want into each output file.
Your method seems sound (download, split, upload).
You should run the commands from an Amazon EC2 instance in the same region as the Amazon S3 bucket.
Use the AWS Command-Line Interface (CLI) to download/upload the files:
aws s3 cp s3://my-bucket/big-file.txt .
aws s3 cp --recursive folder-with-files s3://my-bucket/destination-folder/

parallell copy of buckets/keys from boto3 or boto api between 2 different accounts/connections

I want to copy keys from buckets between 2 different accounts using boto3 api's.
In boto3, I executed the following code and the copy worked
source = boto3.client('s3')
destination = boto3.client('s3')
destination.put_object(source.get_object(Bucket='bucket', Key='key'))
Basically I am fetching data from GET and pasting that with PUT in another account.
On Similar lines in boto api, I have done the following
source = S3Connection()
source_bucket = source.get_bucket('bucket')
source_key = Key(source_bucket, key_name)
destination = S3Connection()
destination_bucket = destination.get_bucket('bucket')
dist_key = Key(destination_bucket, source_key.key)
dist_key.set_contents_from_string(source_key.get_contents_as_string())
The above code achieves the purpose of copying any type of data.
But the speed is really very slow. I get around 15-20 seconds to copy data for 1GB. And I have to copy 100GB plus.
I tried python mutithreading wherein each thread does the copy operation. The performance was bad as it took 30 seconds to copy 1GB. I suspect GIL might be the issue here.
I did multiprocessing and I am getting the same result as of single process i.e. 15-20 seconds for 1GB file.
I am using a very high end server with 48 cores and 128GB RAM. The network speed in my environment is 10GBPS.
Most of the search results tell about copying data between buckets in same account and not across accounts. Can anyone please guide me here. Is my approach wrong? Does anyone have a better solution?
Yes, it is wrong approach.
You shouldn't download the file. You are using AWS infrastructure, so you should make use of the efficient AWS backend call to do the works. Your approach is wasting resources.
boto3.client.copy will do the job better than this.
In addition, you didn't describe what you are trying to achieve(e.g. is this some sort of replication requirement? ).
Because with proper understanding of your own needs, it is possible that you don't even need a server to do the job : S3 Bucket events trigger, lambda etc can all execute the copying job without a server.
To copy file between two different AWS account, you can checkout this link Copy S3 object between AWS account
Note :
S3 is a huge virtual object store for everyone, that's why the bucket name MUST be unique. This also mean, the S3 "controller" can done a lot of fancy work similar to a file server , e.g. replication,copy, move file in the backend, without involving network traffics.
As long as you setup the proper IAM permission/policies for the destination bucket, object can move across bucket without additional server.
This is almost similar to file server. User can copy file to each other without "download/upload", instead, one just create a folder with write permission for all, file copy from another user is all done within the file server, with fastest raw disk I/O performance. You don't need powerful instance nor high performance network using backend S3 copy API.
Your method is similar to attempt FTP download file from user using the same file server, which create unwanted network traffics.
You should check out the TransferManager in boto3. It will automatically handle the threading of multipart uploads in an efficient way. See the docs for more detail.
Basically you must have to use the upload_file method and TransferManager will take care of the rest.
import boto3
# Get the service client
s3 = boto3.client('s3')
# Upload tmp.txt to bucket-name at key-name
s3.upload_file("tmp.txt", "bucket-name", "key-name")

Categories