I have a text file on S3 with around 300 million lines. I'm looking to split this file into smaller files of 1,000 lines each (with the last file containing the remainder), which I'd then like to put into another folder or bucket on S3.
So far, I've been running this on my local drive using the linux command:
split -l 1000 file
which splits the original file into smaller files of 1,000 lines. However, with a larger file like this, it seems inefficient to download and then re-upload from my local drive back up to S3.
What would be the most efficient way to split this S3 file, ideally using Python (in a Lambda function) or using other S3 commands? Is it faster to just run this on my local drive?
Anything that you do will have to download the file, split it, and re-upload it. The only question is where, and whether local disk is involved.
John Rotenstein gave you an example using local disk on an EC2 instance. This has the benefit of running in the AWS datacenters, so it gets a high-speed connection, but has the limitations that (1) you need disk space to store the original file and its pieces, and (2) you need an EC2 instance where you can do this.
One small optimization is to avoid the local copy of the big file, by using a hyphen as the destination of the s3 cp: this will send the output to standard out, and you can then pipe it into split (here I'm also using a hyphen to tell split to read from standard input):
aws s3 cp s3://my-bucket/big-file.txt - | split -l 1000 - output.
aws s3 cp output.* s3://dest-bucket/
Again, this requires an EC2 instance to run it on, and the storage space for the output files. There is, however, a flag to split that will let you run a shell command for each file in the split:
aws s3 cp s3://src-bucket/src-file - | split -b 1000 --filter 'aws s3 cp - s3://dst-bucket/result.$FILE' -
So now you've eliminated the issue of local storage, but are left with the issue of where to run it. My recommendation would be AWS Batch, which can spin up an EC2 instance for just the time needed to perform the command.
You can, of course, write a Python script to do this on Lambda, and that would have the benefit of being triggered automatically when the source file has been uploaded to S3. I'm not that familiar with the Python SDK (boto), but it appears that get_object will return the original file's body as a stream of bytes, which you can then iterate over as lines, accumulating however many lines you want into each output file.
Your method seems sound (download, split, upload).
You should run the commands from an Amazon EC2 instance in the same region as the Amazon S3 bucket.
Use the AWS Command-Line Interface (CLI) to download/upload the files:
aws s3 cp s3://my-bucket/big-file.txt .
aws s3 cp --recursive folder-with-files s3://my-bucket/destination-folder/
Related
We need to move our video file storage to AWS S3. The old location is a cdn, so I only have url for each file (1000+ files, > 1TB total file size). Running an upload tool directly on the storage server is not an option.
I already created a tool that downloads the file, uploads file to S3 bucket and updates the DB records with new HTTP url and works perfectly except it takes forever.
Downloading the file takes some time (considering each file close to a gigabyte) and uploading it takes longer.
Is it possible to upload the video file directly from cdn to S3, so I could reduce processing time into half? Something like reading chunk of file and then putting it to S3 while reading next chunk.
Currently I use System.Net.WebClient to download the file and AWSSDK to upload.
PS: I have no problem with internet speed, I run the app on a server with 1GBit network connection.
No, there isn't a way to direct S3 to fetch a resource, on your behalf, from a non-S3 URL and save it in a bucket.
The only "fetch"-like operation S3 supports is the PUT/COPY operation, where S3 supports fetching an object from one bucket and storing it in another bucket (or the same bucket), even across regions, even across accounts, as long as you have a user with sufficient permission for the necessary operations on both ends of the transaction. In that one case, S3 handles all the data transfer, internally.
Otherwise, the only way to take a remote object and store it in S3 is to download the resource and then upload it to S3 -- however, there's nothing preventing you from doing both things at the same time.
To do that, you'll need to write some code, using presumably either asynchronous I/O or threads, so that you can simultaneously be receiving a stream of downloaded data and uploading it, probably in symmetric chunks, using S3's Multipart Upload capability, which allows you to write individual chunks (minimum 5MB each) which, with a final request, S3 will validate and consolidate into a single object of up to 5TB. Multipart upload supports parallel upload of chunks, and allows your code to retry any failed chunks without restarting the whole job, since the individual chunks don't have to be uploaded or received by S3 in linear order.
If the origin supports HTTP range requests, you wouldn't necessarily even need to receive a "stream," you could discover the size of the object and then GET chunks by range and multipart-upload them. Do this operation with threads or asynch I/O handling multiple ranges in parallel, and you will likely be able to copy an entire object faster than you can download it in a single monolithic download, depending on the factors limiting your download speed.
I've achieved aggregate speeds in the range of 45 to 75 Mbits/sec while uploading multi-gigabyte files into S3 from outside of AWS using this technique.
This has been answered by me in this question, here's the gist:
object = Aws::S3::Object.new(bucket_name: 'target-bucket', key: 'target-key')
object.upload_stream do |write_stream|
IO.copy_stream(URI.open('http://example.com/file.ext'), write_stream)
end
This is no 'direct' pull-from-S3, though. At least this doesn't download each file and then uploads in serial, but streams 'through' the client. If you run the above on an EC2 instance in the same region as your bucket, I believe this is as 'direct' as it gets, and as fast as a direct pull would ever be.
if a proxy ( node express ) is suitable for you then the portions of code at these 2 routes could be combined to do a GET POST fetch chain, retreiving then re-posting the response body to your dest. S3 bucket.
step one creates response.body
step two
set the stream in 2nd link to response from the GET op in link 1 and you will upload to dest.bucket the stream ( arrayBuffer ) from the first fetch
So I need to clone entire instance via AWS SDK for Python (boto3). But my instance have more that 5 millions objects so calling objects_response = client.list_objects_v2(Bucket=bucket_name) recursivly and then perform a copy on each file is taking too much time and not secure due to process fail and starting over with that amount of files. So how to make it in a fast and more secure way?
AWS CLI s3 sync
The AWS Command-Line Interface (CLI) aws s3 sync command can copy S3 objects in parallel.
You can adjust the settings to enable more simultaneous copies via the AWS CLI S3 Configuration file.
The sync command uses CopyObject() to copy objects, which tells S3 to copy the objects between buckets. Therefore, no data is downloaded/uploaded -- it just sends commands to S3 to manage the copy.
Running the sync command from an Amazon EC2 instance will reduce network latency, resulting in a faster copy (especially for many, smaller objects).
You could improve copy speed by running several copies of aws s3 sync (preferably from multiple computers). For example, each could be responsible for copying a separate sub-directory.
See also: Improve transfer performance of sync command in Amazon S3
Amazon S3 Batch Operations
Amazon S3 itself can also perform a Copy operation on a large number of files by using Amazon S3 Batch Operations:
First, create an Amazon S3 Inventory report that lists all the objects (or supply a CSV file with object names)
Then, create a S3 Batch Operations "Copy" job, pointing to the destination bucket
The entire process will be managed by Amazon S3.
I'm getting files into a S3 location on a weekly basis and I need to move these files after processing to another S3 location to archive the files. I have cloudera NiFi hosted on AWS. I can't use putS3Object+DeleteS3Object processors at the end of the flow because in this NiFi process because I'm decompressing the file and adding additional column (compressing file and dropping column hits performance). I need a python/groovy script to move files from S3 loaction. is there any other way to do this?
I need a python/groovy script to move files from S3 loaction. is there any other way to do this?
No, you don't. You can use the record processors or a script to update the files and push them to S3. We pull, mutate and reupload data like this all the time without having to control the upload with a script.
The gsutil cp command hast the -I option to copy multiple files using a list of file names from stdin. Is there a way to do the same in python, preferable using a library like the official google storage client or gcfs/fsspec? Or is it only possible to iterate over all the file names and copy each one?
Using the version currently in gcsfs master (to be released soon), you can copy files to GCS from memory or local files with a list:
gcs.pipe({path1: content1, path2, content2}) # for in-memory bytes
gcs.put([local_path1, local_path2], [remote_path1, remote_path2]) # files
For the latter, you can give just one remote path, which will be assumed to be a directory, and the remote files will get the same basenames as local.
The calls will be processed concurrently, which may be much faster than sequential uploads, especially for small transfers.
I have a lambda function that does some functions and generates a file. I have some append operation in this file so I write the file /tmp folder . after the process I upload to s3 . some times file size will bigger than 512mb so function fails. So is there any method I can write the file directly to s3. S3 does not support appending .I used python language in lambda
#
after so many searches, There is a python package smart_open that allow you to write directly to the s3
smart_open
As described here, the storage of the /tmp folder cannot exceeds 512 MB. It will be better to remove your logic from lambdas and at the end of your process, you just need to upload file to S3 by using boto3.
But if you are using image file, you can use Pillow to reduce file size before.