Uploading large number of files to S3 with boto3 [duplicate]

Uploading large number of files to S3 with boto3 [duplicate] - python

Hey there were some similar questions, but none exactly like this and a fair number of them were multiple years old and out of date.
I have written some code on my server that uploads jpeg photos into an s3 bucket using a key via the boto3 method upload_file. Initially this seemed great. It is a super simple solution to uploading files into s3.
The thing is, I have users. My users are sending their jpegs to my server via a phone app. While I concede that I could generate presigned upload URLs and send them to the phone app, that would require a considerable rewrite of our phone app and API.
So I just want the phone app to send the photos to the server. I then want to send the photos from the server to s3. I implemented this but it is way too slow. I cannot ask my users to tolerate those slow uploads.
What can I do to speed this up?
I did some Google searching and found this: https://medium.com/#alejandro.millan.frias/optimizing-transfer-throughput-of-small-files-to-amazon-s3-or-anywhere-really-301dca4472a5
It suggests that the solution is to increase the number of TCP/IP connections. More TCP/IP connections means faster uploads.
Okay, great!
How do I do that? How do I increase the number of TCP/IP connections so I can upload a single jpeg into AWS s3 faster?
Please help.

Ironically, we've been using boto3 for years, as well as awscli, and we like them both.
But we've often wondered why awscli's aws s3 cp --recursive, or aws s3 sync, are often so much faster than trying to do a bunch of uploads via boto3, even with concurrent.futures's ThreadPoolExecutor or ProcessPoolExecutor (and don't you even dare sharing the same s3.Bucket among your workers: it's warned against in the docs, and for good reasons; nasty crashes will eventually ensue at the most inconvenient time).
Finally, I bit the bullet and looked inside the "customization" code that awscli introduces on top of boto3.
Based on that little exploration, here is a way to speed up the upload of many files to S3 by using the concurrency already built in boto3.s3.transfer, not just for the possible multiparts of a single, large file, but for a whole bunch of files of various sizes as well. That functionality is, as far as I know, not exposed through the higher level APIs of boto3 that are described in the boto3 docs.
The following:
Uses boto3.s3.transfer to create a TransferManager, the very same one that is used by awscli's aws s3 sync, for example.
Extends the max number of threads to 20.
Augments the underlying urllib3 max pool connections capacity used by botocore to match (by default, it uses 10 connections maximum).
Gives you an optional callback capability (demoed here with a tqdm progress bar, but of course you can have whatever callback you'd like).
Is fast (over 100MB/s --tested on an ec2 instance).
I put a complete example as a gist here that includes the generation of 500 random csv files for a total of about 360MB. Here below, we assume you already have a bunch of files in filelist, for a total of totalsize bytes:
import os
import boto3
import botocore
import boto3.s3.transfer as s3transfer
def fast_upload(session, bucketname, s3dir, filelist, progress_func, workers=20):
botocore_config = botocore.config.Config(max_pool_connections=workers)
s3client = session.client('s3', config=botocore_config)
transfer_config = s3transfer.TransferConfig(
use_threads=True,
max_concurrency=workers,
)
s3t = s3transfer.create_transfer_manager(s3client, transfer_config)
for src in filelist:
dst = os.path.join(s3dir, os.path.basename(src))
s3t.upload(
src, bucketname, dst,
subscribers=[
s3transfer.ProgressCallbackInvoker(progress_func),
],
)
s3t.shutdown() # wait for all the upload tasks to finish
Example usage
from tqdm import tqdm
bucketname = '<your-bucket-name>'
s3dir = 'some/path/for/junk'
filelist = [...]
totalsize = sum([os.stat(f).st_size for f in filelist])
with tqdm(desc='upload', ncols=60,
total=totalsize, unit='B', unit_scale=1) as pbar:
fast_upload(boto3.Session(), bucketname, s3dir, filelist, pbar.update)

Related

copy file from AWS S3 (as static website) to AWS S3 [duplicate]

We need to move our video file storage to AWS S3. The old location is a cdn, so I only have url for each file (1000+ files, > 1TB total file size). Running an upload tool directly on the storage server is not an option.
I already created a tool that downloads the file, uploads file to S3 bucket and updates the DB records with new HTTP url and works perfectly except it takes forever.
Downloading the file takes some time (considering each file close to a gigabyte) and uploading it takes longer.
Is it possible to upload the video file directly from cdn to S3, so I could reduce processing time into half? Something like reading chunk of file and then putting it to S3 while reading next chunk.
Currently I use System.Net.WebClient to download the file and AWSSDK to upload.
PS: I have no problem with internet speed, I run the app on a server with 1GBit network connection.

No, there isn't a way to direct S3 to fetch a resource, on your behalf, from a non-S3 URL and save it in a bucket.
The only "fetch"-like operation S3 supports is the PUT/COPY operation, where S3 supports fetching an object from one bucket and storing it in another bucket (or the same bucket), even across regions, even across accounts, as long as you have a user with sufficient permission for the necessary operations on both ends of the transaction. In that one case, S3 handles all the data transfer, internally.
Otherwise, the only way to take a remote object and store it in S3 is to download the resource and then upload it to S3 -- however, there's nothing preventing you from doing both things at the same time.
To do that, you'll need to write some code, using presumably either asynchronous I/O or threads, so that you can simultaneously be receiving a stream of downloaded data and uploading it, probably in symmetric chunks, using S3's Multipart Upload capability, which allows you to write individual chunks (minimum 5MB each) which, with a final request, S3 will validate and consolidate into a single object of up to 5TB. Multipart upload supports parallel upload of chunks, and allows your code to retry any failed chunks without restarting the whole job, since the individual chunks don't have to be uploaded or received by S3 in linear order.
If the origin supports HTTP range requests, you wouldn't necessarily even need to receive a "stream," you could discover the size of the object and then GET chunks by range and multipart-upload them. Do this operation with threads or asynch I/O handling multiple ranges in parallel, and you will likely be able to copy an entire object faster than you can download it in a single monolithic download, depending on the factors limiting your download speed.
I've achieved aggregate speeds in the range of 45 to 75 Mbits/sec while uploading multi-gigabyte files into S3 from outside of AWS using this technique.

This has been answered by me in this question, here's the gist:
object = Aws::S3::Object.new(bucket_name: 'target-bucket', key: 'target-key')
object.upload_stream do |write_stream|
IO.copy_stream(URI.open('http://example.com/file.ext'), write_stream)
end
This is no 'direct' pull-from-S3, though. At least this doesn't download each file and then uploads in serial, but streams 'through' the client. If you run the above on an EC2 instance in the same region as your bucket, I believe this is as 'direct' as it gets, and as fast as a direct pull would ever be.

if a proxy ( node express ) is suitable for you then the portions of code at these 2 routes could be combined to do a GET POST fetch chain, retreiving then re-posting the response body to your dest. S3 bucket.
step one creates response.body
step two
set the stream in 2nd link to response from the GET op in link 1 and you will upload to dest.bucket the stream ( arrayBuffer ) from the first fetch

Boto3 put_object() is very slow

TL;DR: Trying to put .json files into S3 bucket using Boto3, process is very slow. Looking for ways to speed it up.
This is my first question on SO, so I apologize if I leave out any important details. Essentially I am trying to pull data from Elasticsearch and store it in an S3 bucket using Boto3. I referred to this post to pull multiple pages of ES data using the scroll function of the ES Python client. As I am scrolling, I am processing the data and putting it in the bucket as a [timestamp].json format, using this:
s3 = boto3.resource('s3')
data = '{"some":"json","test":"data"}'
key = "path/to/my/file/[timestamp].json"
s3.Bucket('my_bucket').put_object(Key=key, Body=data)
While running this on my machine, I noticed that this process is very slow. Using line profiler, I discovered that this line is consuming over 96% of the time in my entire program:
s3.Bucket('my_bucket').put_object(Key=key, Body=data)
What modification(s) can I make in order to speed up this process? Keep in mind, I am creating the .json files in my program (each one is ~240 bytes) and streaming them directly to S3 rather than saving them locally and uploading the files. Thanks in advance.

Since you are potentially uploading many small files, you should consider a few items:
Some form of threading/multiprocessing. For example you can see How to upload small files to Amazon S3 efficiently in Python
Creating some form of archive file (ZIP) containing sets of your small data blocks and uploading them as larger files. This is of course dependent on your access patterns. If you go this route, be sure to use the boto3 upload_file or upload_fileobj methods instead as they will handle multi-part upload and threading.
S3 performance implications as described in Request Rate and Performance Considerations

parallell copy of buckets/keys from boto3 or boto api between 2 different accounts/connections

I want to copy keys from buckets between 2 different accounts using boto3 api's.
In boto3, I executed the following code and the copy worked
source = boto3.client('s3')
destination = boto3.client('s3')
destination.put_object(source.get_object(Bucket='bucket', Key='key'))
Basically I am fetching data from GET and pasting that with PUT in another account.
On Similar lines in boto api, I have done the following
source = S3Connection()
source_bucket = source.get_bucket('bucket')
source_key = Key(source_bucket, key_name)
destination = S3Connection()
destination_bucket = destination.get_bucket('bucket')
dist_key = Key(destination_bucket, source_key.key)
dist_key.set_contents_from_string(source_key.get_contents_as_string())
The above code achieves the purpose of copying any type of data.
But the speed is really very slow. I get around 15-20 seconds to copy data for 1GB. And I have to copy 100GB plus.
I tried python mutithreading wherein each thread does the copy operation. The performance was bad as it took 30 seconds to copy 1GB. I suspect GIL might be the issue here.
I did multiprocessing and I am getting the same result as of single process i.e. 15-20 seconds for 1GB file.
I am using a very high end server with 48 cores and 128GB RAM. The network speed in my environment is 10GBPS.
Most of the search results tell about copying data between buckets in same account and not across accounts. Can anyone please guide me here. Is my approach wrong? Does anyone have a better solution?

Yes, it is wrong approach.
You shouldn't download the file. You are using AWS infrastructure, so you should make use of the efficient AWS backend call to do the works. Your approach is wasting resources.
boto3.client.copy will do the job better than this.
In addition, you didn't describe what you are trying to achieve(e.g. is this some sort of replication requirement? ).
Because with proper understanding of your own needs, it is possible that you don't even need a server to do the job : S3 Bucket events trigger, lambda etc can all execute the copying job without a server.
To copy file between two different AWS account, you can checkout this link Copy S3 object between AWS account
Note :
S3 is a huge virtual object store for everyone, that's why the bucket name MUST be unique. This also mean, the S3 "controller" can done a lot of fancy work similar to a file server , e.g. replication,copy, move file in the backend, without involving network traffics.
As long as you setup the proper IAM permission/policies for the destination bucket, object can move across bucket without additional server.
This is almost similar to file server. User can copy file to each other without "download/upload", instead, one just create a folder with write permission for all, file copy from another user is all done within the file server, with fastest raw disk I/O performance. You don't need powerful instance nor high performance network using backend S3 copy API.
Your method is similar to attempt FTP download file from user using the same file server, which create unwanted network traffics.

You should check out the TransferManager in boto3. It will automatically handle the threading of multipart uploads in an efficient way. See the docs for more detail.
Basically you must have to use the upload_file method and TransferManager will take care of the rest.
import boto3
# Get the service client
s3 = boto3.client('s3')
# Upload tmp.txt to bucket-name at key-name
s3.upload_file("tmp.txt", "bucket-name", "key-name")

aws boto3 s3 put_object error handling/testing

How should errors be handled/tested for python AWS boto3 s3 put_object? For example:
import boto3
s3 = boto3.resource('s3')
bucket = s3.Bucket('foo')
bucket.put_object(Key='bar', Body='foobar')
Are the errors that can arise documented somewhere? Is the following even the right documentation page (it seems to be for boto3.client('s3') client, not boto3.resource('s3')), and if so where are the errors documented?
http://boto3.readthedocs.io/en/latest/reference/services/s3.html#S3.Client.put_object
Simple errors like a non-existent bucket seems easy enough to test, but can spurious errors occur and if so how can that kind of error handling be tested? Are there limits to the upload rate? I tried the following and was surprised to see all 10000 files successfully created after about 2 minutes of running. Does s3 block as opposed to error when some rate is exceeded?
from concurrent.futures import ThreadPoolExecutor
import boto3
s3 = boto3.resource('s3')
bucket = s3.Bucket('foo')
def put(i):
bucket.put_object(Key='bar/%d' % i, Body='foobar')
executor = ThreadPoolExecutor(max_workers=1024)
for i in range(10000):
executor.submit(put, i)
Is it good practice to retry the put_object call 1 or more times if some error occurs?

AWS s3 does not restrict uploads based on requests. The restriction is only for size:
For Example:
1 POST request will upload files upto 5GB
2 PUT can upload upto 160 GB of size
The errors you are trying or expecting to handle are nothing but client/browser restriction while uploading multiple files at a time.
Boto3 Upload interface do have parameter called 'config', in which u can specify concurrent uploads:
# To consume less downstream bandwidth, decrease the maximum concurrency
config = TransferConfig(max_concurrency=5)

How to efficiently copy all files from one directory to another in an amazon S3 bucket with boto?

I need to copy all keys from '/old/dir/' to '/new/dir/' in an amazon S3 bucket.
I came up with this script (quick hack):
import boto
s3 = boto.connect_s3()
thebucket = s3.get_bucket("bucketname")
keys = thebucket.list('/old/dir')
for k in keys:
newkeyname = '/new/dir' + k.name.partition('/old/dir')[2]
print 'new key name:', newkeyname
thebucket.copy_key(newkeyname, k.bucket.name, k.name)
For now it is working but is much slower than what I can do manually in the graphical managment console by just copy/past with the mouse. Very frustrating and there are lots of keys to copy...
Do you know any quicker method ? Thanks.
Edit: maybe I can do this with concurrent copy processes. I'm not really familiar with boto copy keys methods and how many concurrent processes I can send to amazon.
Edit2: i'm currently learning Python multiprocessing. Let's see if I can send 50 copy operations simultaneously...
Edit 3: I tried with 30 concurrent copy using Python multiprocessing module. Copy was much faster than within the console and less error prone. There is a new issue with large files (>5Gb): boto raises an exception. I need to debug this before posting the updated script.

Regarding your issue with files over 5GB: S3 doesn't support uploading files over 5GB using the PUT method, which is what boto tries to do (see boto source, Amazon S3 documentation).
Unfortunately I'm not sure how you can get around this, apart from downloading it and re-uploading in a multi-part upload. I don't think boto supports a multi-part copy operation yet (if such a thing even exists)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.