Boto S3 set_contents_from_filename gets stuck on large files

Boto S3 set_contents_from_filename gets stuck on large files - python

I am having a problem that large file (over 1GB) uploads to s3 get stuck when using boto set_contents_from_filename.
I have tried using set_contents_from_file instead, and I am getting the same thing.
I am using the cb argument on both functions to call a callback function while uploading which will tell me how is the upload progressing. I see that a 1GB file gets stuck somewhere around 800MB.
EDIT: It seems that this function has a memory leak, as described here:
boto set_contents_from_filename memory leak

It seems you are trying to use boto, which is slowly becoming obsolete.
In long term, changing to boto3 is inevitable as older boto is not really maintained. See boto3 is not a replacement of boto (yet?)
You may find example of uploading files here:
https://stackoverflow.com/a/29636604/346478

Related

Uploading large file to S3/D42 as parallel multipart with python boto

I am trying to upload a 400 GB .ibd file of MySql DB machine into D42/S3.
I am using set_contents_from_file function of Python boto. But it is taking a lot of time and I cannot see the progress (about how much uploaded/left).
Does anyone have any python script to use thread or parallel multipart upload? It's a very simple use case for end-user, but boto's documentation doesn't have any function like this.

Finally I did it with 'S3cmd' and not with python.

Boto3 put_object() is very slow

TL;DR: Trying to put .json files into S3 bucket using Boto3, process is very slow. Looking for ways to speed it up.
This is my first question on SO, so I apologize if I leave out any important details. Essentially I am trying to pull data from Elasticsearch and store it in an S3 bucket using Boto3. I referred to this post to pull multiple pages of ES data using the scroll function of the ES Python client. As I am scrolling, I am processing the data and putting it in the bucket as a [timestamp].json format, using this:
s3 = boto3.resource('s3')
data = '{"some":"json","test":"data"}'
key = "path/to/my/file/[timestamp].json"
s3.Bucket('my_bucket').put_object(Key=key, Body=data)
While running this on my machine, I noticed that this process is very slow. Using line profiler, I discovered that this line is consuming over 96% of the time in my entire program:
s3.Bucket('my_bucket').put_object(Key=key, Body=data)
What modification(s) can I make in order to speed up this process? Keep in mind, I am creating the .json files in my program (each one is ~240 bytes) and streaming them directly to S3 rather than saving them locally and uploading the files. Thanks in advance.

Since you are potentially uploading many small files, you should consider a few items:
Some form of threading/multiprocessing. For example you can see How to upload small files to Amazon S3 efficiently in Python
Creating some form of archive file (ZIP) containing sets of your small data blocks and uploading them as larger files. This is of course dependent on your access patterns. If you go this route, be sure to use the boto3 upload_file or upload_fileobj methods instead as they will handle multi-part upload and threading.
S3 performance implications as described in Request Rate and Performance Considerations

boto3 s3 put_object times out

I've got a python script that Works On My Machine (OSX, python 2.7.13, boto3 1.4.4) but won't work for my colleague (Windows7, otherwise same).
The authentication seems to work, and we can both s3's list_objects_v2 and get_object. However when he tries to upload with put_object, it times out. Here is a full log; the upload starts at line 45.
I've tried using his credentials and it works. He's tried uploading a tiny file and it'll work when it's in the bytes range, but even kb is too big for it. We've even tried it on another windows machine on another internet connection with no luck.
My upload code is pretty simple:
with open("tmp_build.zip", "r") as zip_to_upload:
upload_response = s3.put_object(Bucket=target_bucket, Body=zip_to_upload, Key=build_type+".zip")
The Key resolves to test.zip in our runs, and the file is about 15mb.
Why is it failing on windows? What more debug info can I give you?

Using inspiration from this https://github.com/boto/boto3/issues/870 issue, I added .read() to my Body parameter, and lo it works.

Might be network issues. are you on the same network?
are you able to upload it using the AWS-CLI
try the following
aws s3 cp my-file.txt s3://my-s3-bucket/data/ --debug
also I would consider adding X retries to the upload might give you more information on the error at hand. most of the times these are sporadic network related issues

parallell copy of buckets/keys from boto3 or boto api between 2 different accounts/connections

I want to copy keys from buckets between 2 different accounts using boto3 api's.
In boto3, I executed the following code and the copy worked
source = boto3.client('s3')
destination = boto3.client('s3')
destination.put_object(source.get_object(Bucket='bucket', Key='key'))
Basically I am fetching data from GET and pasting that with PUT in another account.
On Similar lines in boto api, I have done the following
source = S3Connection()
source_bucket = source.get_bucket('bucket')
source_key = Key(source_bucket, key_name)
destination = S3Connection()
destination_bucket = destination.get_bucket('bucket')
dist_key = Key(destination_bucket, source_key.key)
dist_key.set_contents_from_string(source_key.get_contents_as_string())
The above code achieves the purpose of copying any type of data.
But the speed is really very slow. I get around 15-20 seconds to copy data for 1GB. And I have to copy 100GB plus.
I tried python mutithreading wherein each thread does the copy operation. The performance was bad as it took 30 seconds to copy 1GB. I suspect GIL might be the issue here.
I did multiprocessing and I am getting the same result as of single process i.e. 15-20 seconds for 1GB file.
I am using a very high end server with 48 cores and 128GB RAM. The network speed in my environment is 10GBPS.
Most of the search results tell about copying data between buckets in same account and not across accounts. Can anyone please guide me here. Is my approach wrong? Does anyone have a better solution?

Yes, it is wrong approach.
You shouldn't download the file. You are using AWS infrastructure, so you should make use of the efficient AWS backend call to do the works. Your approach is wasting resources.
boto3.client.copy will do the job better than this.
In addition, you didn't describe what you are trying to achieve(e.g. is this some sort of replication requirement? ).
Because with proper understanding of your own needs, it is possible that you don't even need a server to do the job : S3 Bucket events trigger, lambda etc can all execute the copying job without a server.
To copy file between two different AWS account, you can checkout this link Copy S3 object between AWS account
Note :
S3 is a huge virtual object store for everyone, that's why the bucket name MUST be unique. This also mean, the S3 "controller" can done a lot of fancy work similar to a file server , e.g. replication,copy, move file in the backend, without involving network traffics.
As long as you setup the proper IAM permission/policies for the destination bucket, object can move across bucket without additional server.
This is almost similar to file server. User can copy file to each other without "download/upload", instead, one just create a folder with write permission for all, file copy from another user is all done within the file server, with fastest raw disk I/O performance. You don't need powerful instance nor high performance network using backend S3 copy API.
Your method is similar to attempt FTP download file from user using the same file server, which create unwanted network traffics.

You should check out the TransferManager in boto3. It will automatically handle the threading of multipart uploads in an efficient way. See the docs for more detail.
Basically you must have to use the upload_file method and TransferManager will take care of the rest.
import boto3
# Get the service client
s3 = boto3.client('s3')
# Upload tmp.txt to bucket-name at key-name
s3.upload_file("tmp.txt", "bucket-name", "key-name")

How to efficiently copy all files from one directory to another in an amazon S3 bucket with boto?

I need to copy all keys from '/old/dir/' to '/new/dir/' in an amazon S3 bucket.
I came up with this script (quick hack):
import boto
s3 = boto.connect_s3()
thebucket = s3.get_bucket("bucketname")
keys = thebucket.list('/old/dir')
for k in keys:
newkeyname = '/new/dir' + k.name.partition('/old/dir')[2]
print 'new key name:', newkeyname
thebucket.copy_key(newkeyname, k.bucket.name, k.name)
For now it is working but is much slower than what I can do manually in the graphical managment console by just copy/past with the mouse. Very frustrating and there are lots of keys to copy...
Do you know any quicker method ? Thanks.
Edit: maybe I can do this with concurrent copy processes. I'm not really familiar with boto copy keys methods and how many concurrent processes I can send to amazon.
Edit2: i'm currently learning Python multiprocessing. Let's see if I can send 50 copy operations simultaneously...
Edit 3: I tried with 30 concurrent copy using Python multiprocessing module. Copy was much faster than within the console and less error prone. There is a new issue with large files (>5Gb): boto raises an exception. I need to debug this before posting the updated script.

Regarding your issue with files over 5GB: S3 doesn't support uploading files over 5GB using the PUT method, which is what boto tries to do (see boto source, Amazon S3 documentation).
Unfortunately I'm not sure how you can get around this, apart from downloading it and re-uploading in a multi-part upload. I don't think boto supports a multi-part copy operation yet (if such a thing even exists)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.