I am trying to migrate an AWS Lambda function written in Python to CF that
unzips on-the-fly and read line-by-line
performs some light transformations on each line
write output (a line at a time or chunks) uncompressed to GCS
The output is > 2GB - but slightly less than 3GB so it fits in Lambda, just.
Well, it seems impossible or way more involved in GCP:
uncompressed cannot fit in memory or /tmp - limited to 2048MB as of writing this - so Python Client lib upload_from_file (or _filename) cannot be used
there is this official paper but to my surprise, it's referring to boto, a library initially designed for AWS S3, and a quite outdated one since boto3 is out for some time. No genuine GCP method to stream write or read
Node.js has a simple createWriteStream() - nice article here btw - but no equivalent one-liner in Python
Resumable media upload sounds like it but lot of code for something handled in Node much easier
AppEngine had cloudstorage but not available outside of it - and obsolete
little to no example out there on a working wrapper for writing text/plain data line-by-line as if GCS was a local filesystem. This is not limited to Cloud Functions and a lacking feature of the Python Client library, but it is more acute in CF due the resource constraints. Btw, I was part of a discussion to add a writeable IOBase function but it had no traction.
obviously using a VM or DataFlow are out of question for the task at hand.
In my mind, stream (or stream-like) reading/writing from cloud-based storage should even be included in the Python standard library.
As recommended back then, one can still use GCSFS, which behind the scenes commits the upload in chunks for you while you are writing stuff to a FileObj.
The same team wrote s3fs. I don't know for Azure.
AFAIC, I will stick to AWS Lambda as the output can fit in memory - for now - but multipart upload is the way to go to support any output size with a minimum of memory.
Thoughts or alternatives ?
I got confused with multipart vs. resumable upload. The latter is what you need for "streaming" - it's actually more like uploading chunks of a buffered stream.
Multipart upload is to load data and custom metadata at once, in the same API call.
While I like GCSFS very much - Martin, his main contributor is very responsive -, I recently found an alternative that uses the google-resumable-media library.
GCSFS is built upon the core http API whereas Seth's solution uses a low-level library maintained by Google, more in sync with API changes and which includes exponential backup. The latter is really a must for large/long stream as connection may drop, even within GCP - we faced the issue with GCF.
On a closing note, I still believe that the Google Cloud Library is the right place to add stream-like functionality, with basic write and read. It has the core code already.
If you too are interested in that feature in the core lib, thumbs up the issue here - assuming priority is based thereon.
smart_open now has support for GCS and also has support for on the fly decompression.
import lzma
from smart_open import open, register_compressor
def _handle_xz(file_obj, mode):
return lzma.LZMAFile(filename=file_obj, mode=mode, format=lzma.FORMAT_XZ)
register_compressor('.xz', _handle_xz)
# stream from GCS
with open('gs://my_bucket/my_file.txt.xz') as fin:
for line in fin:
print(line)
# stream content *into* GCS (write mode):
with open('gs://my_bucket/my_file.txt.xz', 'wb') as fout:
fout.write(b'hello world')
Related
Background
I finally convinced someone willing to share his full archival node 5868GiB database for free (which now requires to be built in ram and thus requires 100000$ worth of ram in order to be built but can be run on an ssd once done).
However he want to send it only through sending a single tar file over raw tcp using a rather slow (400Mps) connection for this task.
I m needing to get it on dropbox and as a result, he don’t want to use the https://www.dropbox.com/request/[my upload key here] allowing to upload files through a web browser without a dropbox account (it really annoyed him that I talked about using an other method or compressing the database to the point he is on the verge of changing his mind about sharing it).
Because on my side, dropbox allows using 10Tib of storage for free during 30 days and I didn’t receive the required ssd yet (so once received I will be able to download it using a faster speed).
The problem
I m fully aware of upload file to my dropbox from python script but in my case the file doesn t fit into a memory buffer not even on disk.
And previously in api v1 it wasn t possible to append data to an exisiting file (but I didn t find the answer for v2).
To upload a large file to the Dropbox API using the Dropbox Python SDK, you would use upload sessions to upload it in pieces. There's a basic example here.
Note that the Dropbox API only supports files up to 350 GB though.
One csv file is uploaded to the cloud storage everyday around 0200 hrs but sometime due to job fail or system crash file upload happens very late. So I want to create a cloud function that can trigger my python bq load script whenever the file is uploaded to the storage.
file_name : seller_data_{date}
bucket name : sale_bucket/
The question lacks enough description of the desired usecase and any issues the OP has faced. However, here are a few possible approaches that you might chose from depending on the usecase.
The simple way: Cloud Functions with Storage trigger.
This is probably the simplest and most efficient way of running a Python function whenever a file gets uploaded to your bucket.
The most basic tutorial is this.
The hard way: App Engine with a few tricks.
Having a basic Flask application hosted on GAE (Standard or Flex), with an endpoint specifically to handle this chek of the files existing, download object, manipulate it and then do something.
This route can act as a custom HTTP triggered function, where once it receives a request (could be from a simple curl request, visit from the browser, PubSub event, or even another Cloud Function).
Once it receives a GET (or POST) request, it downloads the object into the /tmp dir, process it and then do something.
The small benefit with GAE over CF is that you can set a minimum of one instance to stay always alive which means you will not have the cold starts, or risk the request timing out before the job is done.
The brutal/overkill way: Clour Run.
Similar approach to App Engine, but with Cloud Run you'll also need to work with the Dockerfile, have in mind that Cloud Run will scale down to zero when there's no usage, and other minor things that apply to building any application on Cloud Run.
########################################
For all the above approaches, some additional things you might want to achieve are the same:
a) Downloading the object and doing some processing on it:
You will have to download it to the /tmp directory as it's the directory for both GAE and CF to store temporary files. Cloud Run is a bit different here but let's not get deep into it as it's an overkill byitself.
However, keep in mind that if your file is large you might cause a high memory usage.
And ALWAYS clean that directory after you have finished with the file. Also when opening a file always use with open ... as it will also make sure to not keep files open.
b) Downloading the latest object in the bucket:
This is a bit tricky and it needs some extra custom code. There are many ways to achieve it, but the one I use (always tho paying close attention to memory usage), is upon the creation of the object I upload to the bucket, I get the current time, use Regex to transform it into something like results_22_6.
What happens now is that once I list the objects from my other script, they are already listed in an accending order. So the last element in the list is the latest object.
So basically what I do then is to check if the filename I have in /tmp is the same as the name of the object[list.length] in the bucket. If yes then do nothing, if no then delete the old one and download the latest one in the bucket.
This might not be optimal, but for me it's kinda preferable.
I am trying to upload a 400 GB .ibd file of MySql DB machine into D42/S3.
I am using set_contents_from_file function of Python boto. But it is taking a lot of time and I cannot see the progress (about how much uploaded/left).
Does anyone have any python script to use thread or parallel multipart upload? It's a very simple use case for end-user, but boto's documentation doesn't have any function like this.
Finally I did it with 'S3cmd' and not with python.
I'm working on a simple app that takes images optimizes them and saves them in cloud storage. I found an example that takes the file and uses PIL to optimize it. The code looks like this:
def inPlaceOptimizeImage(photo_blob):
blob_key = photo_blob.key()
new_blob_key = None
img = Image.open(photo_blob.open())
output = StringIO.StringIO()
img.save(output,img.format, optimized=True,quality=90)
opt_img = output.getvalue()
output.close()
# Create the file
file_name = files.blobstore.create(mime_type=photo_blob.content_type)
# Open the file and write to it
with files.open(file_name, 'a') as f:
f.write(opt_img)
# Finalize the file. Do this before attempting to read it.
files.finalize(file_name)
# Get the file's blob key
return files.blobstore.get_blob_key(file_name)
This works fine locally (although I don't know how well it's being optimized because when I run the uploaded image through something like http://www.jpegmini.com/ it gets reduced by 2.4x still). However when I deploy the app and try uploading images I frequently get 500 errors and these messages in the logs:
F 00:30:33.322 Exceeded soft private memory limit of 128 MB with 156 MB after servicing 7 requests total
W 00:30:33.322 While handling this request, the process that handled this request was found to be using too much memory and was terminated. This is likely to cause a new process to be used for the next request to your application. If you see this message frequently, you may have a memory leak in your application.
I have two questions:
Is this even the best way to optimize and save images in cloud storage?
How do I prevent these 500 errors from occurring?
Thanks in advance.
The error you're experiencing is happening due to the memory limits of your Instance class.
What I would suggest you to do is to edit your .yaml file in order to configure your module, and specify your Instance class to be F2 or higher.
In case you are not using modules, you should also add “module: default” at the beginning of your app.yaml file to let GAE know that this is your default module.
You can take a look to this article from the docs to see the different Instance classes available, and the way to easily configure them.
Another more basic workaround would be to limit the image size when uploading it, but you will eventually finish having a similar issue.
Regarding the previous matter and a way to optimize your images, you may want to take a look at the App Engine Images API that provides the ability to manipulate image data using a dedicated Images service. In your case, you might like the "I'm Feeling Lucky" transformation. By using this API you might not need to update your Instance class.
I have an application I am developing on top of GAE, using Python APIs. I am using the local development server right now. The application involves parsing large block of XML data received from outside service.
So the question is - is there an easy way to get this XML data exported out of the GAE application - e.g., in regular app I would just write it to a temp file, but in GAE app I can not do that. So what could I do instead? I can not easily run all the code that produces the service call outside of GAE since it uses some GAE functions to create the call, but it would be much easier if I could take the XML result out and develop/test the parser part outside and then put it back to GAE app.
I tried to log it using logging and then extract it from the console, but when XML is getting big it doesn't work well. I know there's bulk data import/export APIs but seems to be an overkill for extracting just this one piece of information to write it to data store and then export the whole store. So how to do it in the best way?
How about writing the XML data to the blobstore and then write a handler that uses send_blob to download to your local file system?
You can use the files API to write to the blobstore from you application.