Uploading large file to S3/D42 as parallel multipart with python boto

Uploading large file to S3/D42 as parallel multipart with python boto - python

I am trying to upload a 400 GB .ibd file of MySql DB machine into D42/S3.
I am using set_contents_from_file function of Python boto. But it is taking a lot of time and I cannot see the progress (about how much uploaded/left).
Does anyone have any python script to use thread or parallel multipart upload? It's a very simple use case for end-user, but boto's documentation doesn't have any function like this.

Finally I did it with 'S3cmd' and not with python.

Related

Storing files locally and sending jobs to printer for speed

I'm operating in a factory setting where speed is important. I store order information in the cloud. Barcodes are printed by querying the database for information. Operators use a tkinter app on a raspberry pi that runs a python script to query the cloud.
Currently printing out a barcode takes about 5 seconds to make that query and then use os.system() to print out the barcode.
Is there a faster way to send jobs to the printer?
I've been looking into storing files locally to speed this process up, does anyone have any of what to look into? Network attached storage that downloads relevant files from the cloud nightly?
Any suggestions for running modern factory automation with python?

checkout subprocess.Popen(), which is a flexible/lite version of os.system()

Write-streaming to Google Cloud Storage in Python

I am trying to migrate an AWS Lambda function written in Python to CF that
unzips on-the-fly and read line-by-line
performs some light transformations on each line
write output (a line at a time or chunks) uncompressed to GCS
The output is > 2GB - but slightly less than 3GB so it fits in Lambda, just.
Well, it seems impossible or way more involved in GCP:
uncompressed cannot fit in memory or /tmp - limited to 2048MB as of writing this - so Python Client lib upload_from_file (or _filename) cannot be used
there is this official paper but to my surprise, it's referring to boto, a library initially designed for AWS S3, and a quite outdated one since boto3 is out for some time. No genuine GCP method to stream write or read
Node.js has a simple createWriteStream() - nice article here btw - but no equivalent one-liner in Python
Resumable media upload sounds like it but lot of code for something handled in Node much easier
AppEngine had cloudstorage but not available outside of it - and obsolete
little to no example out there on a working wrapper for writing text/plain data line-by-line as if GCS was a local filesystem. This is not limited to Cloud Functions and a lacking feature of the Python Client library, but it is more acute in CF due the resource constraints. Btw, I was part of a discussion to add a writeable IOBase function but it had no traction.
obviously using a VM or DataFlow are out of question for the task at hand.
In my mind, stream (or stream-like) reading/writing from cloud-based storage should even be included in the Python standard library.
As recommended back then, one can still use GCSFS, which behind the scenes commits the upload in chunks for you while you are writing stuff to a FileObj.
The same team wrote s3fs. I don't know for Azure.
AFAIC, I will stick to AWS Lambda as the output can fit in memory - for now - but multipart upload is the way to go to support any output size with a minimum of memory.
Thoughts or alternatives ?

I got confused with multipart vs. resumable upload. The latter is what you need for "streaming" - it's actually more like uploading chunks of a buffered stream.
Multipart upload is to load data and custom metadata at once, in the same API call.
While I like GCSFS very much - Martin, his main contributor is very responsive -, I recently found an alternative that uses the google-resumable-media library.
GCSFS is built upon the core http API whereas Seth's solution uses a low-level library maintained by Google, more in sync with API changes and which includes exponential backup. The latter is really a must for large/long stream as connection may drop, even within GCP - we faced the issue with GCF.
On a closing note, I still believe that the Google Cloud Library is the right place to add stream-like functionality, with basic write and read. It has the core code already.
If you too are interested in that feature in the core lib, thumbs up the issue here - assuming priority is based thereon.

smart_open now has support for GCS and also has support for on the fly decompression.
import lzma
from smart_open import open, register_compressor
def _handle_xz(file_obj, mode):
return lzma.LZMAFile(filename=file_obj, mode=mode, format=lzma.FORMAT_XZ)
register_compressor('.xz', _handle_xz)
# stream from GCS
with open('gs://my_bucket/my_file.txt.xz') as fin:
for line in fin:
print(line)
# stream content *into* GCS (write mode):
with open('gs://my_bucket/my_file.txt.xz', 'wb') as fout:
fout.write(b'hello world')

How to set up GCP infrastructure to perform search quickly over massive set of json data?

I have about 100 million json files (10 TB), each with a particular field containing a bunch of text, for which I would like to perform a simple substring search and return the filenames of all the relevant json files. They're all currently stored on Google Cloud Storage. Normally for a smaller number of files I might just spin up a VM with many CPUs and run multiprocessing via Python, but alas this is a bit too much.
I want to avoid spending too much time setting up infrastructure like a Hadoop server, or loading all of that into some MongoDB database. My question is: what would be a quick and dirty way to perform this task? My original thoughts were to set up something on Kubernetes with some parallel processing running Python scripts, but I'm open to suggestions and don't really have a clue how to go about this.

Easier would be to just load the GCS data into Big Query and just run your query from there.
Send your data to AWS S3 and use Amazon Athena.
The Kubernetes option would be set up a cluster in GKE and install Presto in it with a lot of workers, use a hive metastore with GCS and query from there. (Presto doesn't have direct GCS connector yet, afaik) -- This option seems more elaborate.
Hope it helps!

Boto3 put_object() is very slow

TL;DR: Trying to put .json files into S3 bucket using Boto3, process is very slow. Looking for ways to speed it up.
This is my first question on SO, so I apologize if I leave out any important details. Essentially I am trying to pull data from Elasticsearch and store it in an S3 bucket using Boto3. I referred to this post to pull multiple pages of ES data using the scroll function of the ES Python client. As I am scrolling, I am processing the data and putting it in the bucket as a [timestamp].json format, using this:
s3 = boto3.resource('s3')
data = '{"some":"json","test":"data"}'
key = "path/to/my/file/[timestamp].json"
s3.Bucket('my_bucket').put_object(Key=key, Body=data)
While running this on my machine, I noticed that this process is very slow. Using line profiler, I discovered that this line is consuming over 96% of the time in my entire program:
s3.Bucket('my_bucket').put_object(Key=key, Body=data)
What modification(s) can I make in order to speed up this process? Keep in mind, I am creating the .json files in my program (each one is ~240 bytes) and streaming them directly to S3 rather than saving them locally and uploading the files. Thanks in advance.

Since you are potentially uploading many small files, you should consider a few items:
Some form of threading/multiprocessing. For example you can see How to upload small files to Amazon S3 efficiently in Python
Creating some form of archive file (ZIP) containing sets of your small data blocks and uploading them as larger files. This is of course dependent on your access patterns. If you go this route, be sure to use the boto3 upload_file or upload_fileobj methods instead as they will handle multi-part upload and threading.
S3 performance implications as described in Request Rate and Performance Considerations

Boto S3 set_contents_from_filename gets stuck on large files

I am having a problem that large file (over 1GB) uploads to s3 get stuck when using boto set_contents_from_filename.
I have tried using set_contents_from_file instead, and I am getting the same thing.
I am using the cb argument on both functions to call a callback function while uploading which will tell me how is the upload progressing. I see that a 1GB file gets stuck somewhere around 800MB.
EDIT: It seems that this function has a memory leak, as described here:
boto set_contents_from_filename memory leak

It seems you are trying to use boto, which is slowly becoming obsolete.
In long term, changing to boto3 is inevitable as older boto is not really maintained. See boto3 is not a replacement of boto (yet?)
You may find example of uploading files here:
https://stackoverflow.com/a/29636604/346478

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Uploading large file to S3/D42 as parallel multipart with python boto - python

Finally I did it with 'S3cmd' and not with python.

Related

Storing files locally and sending jobs to printer for speed

Write-streaming to Google Cloud Storage in Python

How to set up GCP infrastructure to perform search quickly over massive set of json data?

Boto3 put_object() is very slow

Boto S3 set_contents_from_filename gets stuck on large files

Categories

Resources