python aws s3 file distributing - python

I am looking to this tutorial. I would like to know is the anyway to distribute large amount of file over the different objects. As the example let's say I have video file with size 60 GB and I have S3 bucklets with size 4 x 15 GB. Now how can I split my file for keeping that at these size storages. I will be happy if you can share any tutorial.

Check this AWS documentation out. I think it would be useful.
http://docs.aws.amazon.com/AmazonS3/latest/dev/UploadingObjects.html
The import part of the link is below:
Depending on the size of the data you are uploading, Amazon S3 offers the following options:
Upload objects in a single operation—With a single PUT operation, you can upload objects up to 5 GB in size.
For more information, see Uploading Objects in a Single Operation.
Upload objects in parts—Using the multipart upload API, you can upload large objects, up to 5 TB.
The multipart upload API is designed to improve the upload experience for larger objects. You can upload objects in parts. These object parts can be uploaded independently, in any order, and in parallel. You can use a multipart upload for objects from 5 MB to 5 TB in size. For more information, see Uploading Objects Using Multipart Upload API.
We recommend that you use multipart uploading in the following ways:
If you're uploading large objects over a stable high-bandwidth network, use multipart uploading to maximize the use of your available bandwidth by uploading object parts in parallel for multi-threaded performance.
If you're uploading over a spotty network, use multipart uploading to increase resiliency to network errors by avoiding upload restarts. When using multipart uploading, you need to retry uploading only parts that are interrupted during the upload. You don't need to restart uploading your object from the beginning.
For more information about mutipart uploads, see Multipart Upload Overview.

S3 buckets don't have restrictions on size so there is typically no reason to split a file across buckets.
If you really want to split the file across buckets (and I would not recommend doing this) you can write the first 25% of bytes to an object in bucket A, the next 25% of bytes to an object in bucket B, etc. But that's moderately complicated (you have to split the source file and upload just the relevant bytes) and then you have to deal with combining them later in order to retrieve the complete file.
Why do you want to split the file across buckets?

Related

copy file from AWS S3 (as static website) to AWS S3 [duplicate]

We need to move our video file storage to AWS S3. The old location is a cdn, so I only have url for each file (1000+ files, > 1TB total file size). Running an upload tool directly on the storage server is not an option.
I already created a tool that downloads the file, uploads file to S3 bucket and updates the DB records with new HTTP url and works perfectly except it takes forever.
Downloading the file takes some time (considering each file close to a gigabyte) and uploading it takes longer.
Is it possible to upload the video file directly from cdn to S3, so I could reduce processing time into half? Something like reading chunk of file and then putting it to S3 while reading next chunk.
Currently I use System.Net.WebClient to download the file and AWSSDK to upload.
PS: I have no problem with internet speed, I run the app on a server with 1GBit network connection.
No, there isn't a way to direct S3 to fetch a resource, on your behalf, from a non-S3 URL and save it in a bucket.
The only "fetch"-like operation S3 supports is the PUT/COPY operation, where S3 supports fetching an object from one bucket and storing it in another bucket (or the same bucket), even across regions, even across accounts, as long as you have a user with sufficient permission for the necessary operations on both ends of the transaction. In that one case, S3 handles all the data transfer, internally.
Otherwise, the only way to take a remote object and store it in S3 is to download the resource and then upload it to S3 -- however, there's nothing preventing you from doing both things at the same time.
To do that, you'll need to write some code, using presumably either asynchronous I/O or threads, so that you can simultaneously be receiving a stream of downloaded data and uploading it, probably in symmetric chunks, using S3's Multipart Upload capability, which allows you to write individual chunks (minimum 5MB each) which, with a final request, S3 will validate and consolidate into a single object of up to 5TB. Multipart upload supports parallel upload of chunks, and allows your code to retry any failed chunks without restarting the whole job, since the individual chunks don't have to be uploaded or received by S3 in linear order.
If the origin supports HTTP range requests, you wouldn't necessarily even need to receive a "stream," you could discover the size of the object and then GET chunks by range and multipart-upload them. Do this operation with threads or asynch I/O handling multiple ranges in parallel, and you will likely be able to copy an entire object faster than you can download it in a single monolithic download, depending on the factors limiting your download speed.
I've achieved aggregate speeds in the range of 45 to 75 Mbits/sec while uploading multi-gigabyte files into S3 from outside of AWS using this technique.
This has been answered by me in this question, here's the gist:
object = Aws::S3::Object.new(bucket_name: 'target-bucket', key: 'target-key')
object.upload_stream do |write_stream|
IO.copy_stream(URI.open('http://example.com/file.ext'), write_stream)
end
This is no 'direct' pull-from-S3, though. At least this doesn't download each file and then uploads in serial, but streams 'through' the client. If you run the above on an EC2 instance in the same region as your bucket, I believe this is as 'direct' as it gets, and as fast as a direct pull would ever be.
if a proxy ( node express ) is suitable for you then the portions of code at these 2 routes could be combined to do a GET POST fetch chain, retreiving then re-posting the response body to your dest. S3 bucket.
step one creates response.body
step two
set the stream in 2nd link to response from the GET op in link 1 and you will upload to dest.bucket the stream ( arrayBuffer ) from the first fetch

How to stream a very large file to Dropbox using python v2 api

Background
I finally convinced someone willing to share his full archival node 5868GiB database for free (which now requires to be built in ram and thus requires 100000$ worth of ram in order to be built but can be run on an ssd once done).
However he want to send it only through sending a single tar file over raw tcp using a rather slow (400Mps) connection for this task.
I m needing to get it on dropbox and as a result, he don’t want to use the https://www.dropbox.com/request/[my upload key here] allowing to upload files through a web browser without a dropbox account (it really annoyed him that I talked about using an other method or compressing the database to the point he is on the verge of changing his mind about sharing it).
Because on my side, dropbox allows using 10Tib of storage for free during 30 days and I didn’t receive the required ssd yet (so once received I will be able to download it using a faster speed).
The problem
I m fully aware of upload file to my dropbox from python script but in my case the file doesn t fit into a memory buffer not even on disk.
And previously in api v1 it wasn t possible to append data to an exisiting file (but I didn t find the answer for v2).
To upload a large file to the Dropbox API using the Dropbox Python SDK, you would use upload sessions to upload it in pieces. There's a basic example here.
Note that the Dropbox API only supports files up to 350 GB though.

Boto3 put_object() is very slow

TL;DR: Trying to put .json files into S3 bucket using Boto3, process is very slow. Looking for ways to speed it up.
This is my first question on SO, so I apologize if I leave out any important details. Essentially I am trying to pull data from Elasticsearch and store it in an S3 bucket using Boto3. I referred to this post to pull multiple pages of ES data using the scroll function of the ES Python client. As I am scrolling, I am processing the data and putting it in the bucket as a [timestamp].json format, using this:
s3 = boto3.resource('s3')
data = '{"some":"json","test":"data"}'
key = "path/to/my/file/[timestamp].json"
s3.Bucket('my_bucket').put_object(Key=key, Body=data)
While running this on my machine, I noticed that this process is very slow. Using line profiler, I discovered that this line is consuming over 96% of the time in my entire program:
s3.Bucket('my_bucket').put_object(Key=key, Body=data)
What modification(s) can I make in order to speed up this process? Keep in mind, I am creating the .json files in my program (each one is ~240 bytes) and streaming them directly to S3 rather than saving them locally and uploading the files. Thanks in advance.
Since you are potentially uploading many small files, you should consider a few items:
Some form of threading/multiprocessing. For example you can see How to upload small files to Amazon S3 efficiently in Python
Creating some form of archive file (ZIP) containing sets of your small data blocks and uploading them as larger files. This is of course dependent on your access patterns. If you go this route, be sure to use the boto3 upload_file or upload_fileobj methods instead as they will handle multi-part upload and threading.
S3 performance implications as described in Request Rate and Performance Considerations

What is best practice of the the case of writing text output into S3 bucket?

My Pipeline(python) is writing text data which read from BigQuery.
I have two option for Writing text data into S3 by my knowledge.
The first option is "Writer subclass" of custom Sink writes each record ito S3 bucket directory.
It seems the transfer efficiency is very low in my experience.
The Writer spends about a second per 1 record.(Also My datasouce has millions records!!)
The second option is to send the the text data into GCS which was written into GCS in beforehand.
I seem this option is inefficient.
The reason is unnecessary traffic (upload/download) occurs between GCS and DataFlow.
(My Pipeline does not require to store the text data into GCS)
Is there better way to write into S3 than my two options?
Regards.
The first approach of writing a custom sink for S3 seems good. You could use a buffer to batch upload writes to S3 instead of writing a file per record. If your buffer is not huge then you can directly upload to s3 otherwise using the multipart upload API would be a good alternative as well. Code in gcsio might be useful here.
In the second case you can directly use the TextSink to write to GCS but you'll have to move the files from GCS to S3 somehow later if the data needs to live in s3 at the end.
I have also created https://issues.apache.org/jira/browse/BEAM-994 for tracking the need for supporting S3

Given an archive_id, how might I go about moving an archive from AWS Glacier to an S3 Bucket?

I have written an archival system with Python Boto that tar's several dirs of files and uploads to Glacier. This is all working great and I am storing all of the archive ID's.
I wanted to test downloading a large archive (about 120GB). I initiated the retrieval, but the download took > 24 hours and at the end, I got a 403 since the resource was no longer available and the download failed.
If I archived straight from my server to Glacier (skipping S3), is it possible to initiate a restore that restores an archive to an S3 bucket so I can take longer than 24 hours to download a copy? I didn't see anything in either the S3 or Glacier Boto docs.
Ideally I'd do this with Boto but would be open to other scriptable options. Does anyone know how given an archiveId, I might go about moving an archive from AWS Glacier to an S3 Bucket? If this is not possible, are there other options to give my self more time to download large files?
Thanks!
http://docs.pythonboto.org/en/latest/ref/glacier.html
http://docs.pythonboto.org/en/latest/ref/s3.html
The direct Glacier API and the S3/Glacier integration are not connected to each other in a way that is accessible to AWS users.
If you upload directly to Glacier, the only way to get the data back is to fetch it back directly from Glacier.
Conversely, if you add content to Glacier via S3 lifecycle policies, then there is no exposed Glacier archive ID, and the only way to get the content is to do an S3 restore.
It's essentially as if "you" aren't the Glacier customer, but rather "S3" is the Glacier customer, when you use the Glacier/S3 integration. (In fact, that's a pretty good mental model -- the Glacier storage charges are even billed differently -- files stored through the S3 integration are billed together with the other S3 charges on the monthly invoice, not with the Glacier charges).
The way to accomplish what you are directly trying to accomplish is to do range retrievals, where you only request that Glacier restore a portion of the archive.
Another reason you could choose to perform a range retrieval is to manage how much data you download from Amazon Glacier in a given period. When data is retrieved from Amazon Glacier, a retrieval job is first initiated, which will typically complete in 3-5 hours. The data retrieved is then available for download for 24 hours. You could therefore retrieve an archive in parts in order to manage the schedule of your downloads. You may also choose to perform range retrievals in order to reduce or eliminate your retrieval fees.
— http://aws.amazon.com/glacier/faqs/
You'd then need to reassemble the pieces. That last part seems like a big advantage also, since Glacier does charge more, the more data you "restore" at a time. Note this isn't a charge for downloading the data, it's a charge for the restore operation, whether you download it or not.
One advantage I see of the S3 integration is that you can leave your data "cooling off" in S3 for a few hours/days/weeks before you put it "on ice" in Glacier, which happens automatically... so you can fetch it back from S3 without paying a retrieval charge, until it's been sitting in S3 for the amount of time you've specified, after which it automatically migrates. The potential downside is that it seems to introduce more moving parts.
Using document lifecycle policies you can move files directly from S3 to Glacier and you can also restore those object back to S3 using the restore method of the boto.s3.Key object. Also, see this section of the S3 docs for more information on how restore works.

Categories