I'm working on a simple app that takes images optimizes them and saves them in cloud storage. I found an example that takes the file and uses PIL to optimize it. The code looks like this:
def inPlaceOptimizeImage(photo_blob):
blob_key = photo_blob.key()
new_blob_key = None
img = Image.open(photo_blob.open())
output = StringIO.StringIO()
img.save(output,img.format, optimized=True,quality=90)
opt_img = output.getvalue()
output.close()
# Create the file
file_name = files.blobstore.create(mime_type=photo_blob.content_type)
# Open the file and write to it
with files.open(file_name, 'a') as f:
f.write(opt_img)
# Finalize the file. Do this before attempting to read it.
files.finalize(file_name)
# Get the file's blob key
return files.blobstore.get_blob_key(file_name)
This works fine locally (although I don't know how well it's being optimized because when I run the uploaded image through something like http://www.jpegmini.com/ it gets reduced by 2.4x still). However when I deploy the app and try uploading images I frequently get 500 errors and these messages in the logs:
F 00:30:33.322 Exceeded soft private memory limit of 128 MB with 156 MB after servicing 7 requests total
W 00:30:33.322 While handling this request, the process that handled this request was found to be using too much memory and was terminated. This is likely to cause a new process to be used for the next request to your application. If you see this message frequently, you may have a memory leak in your application.
I have two questions:
Is this even the best way to optimize and save images in cloud storage?
How do I prevent these 500 errors from occurring?
Thanks in advance.
The error you're experiencing is happening due to the memory limits of your Instance class.
What I would suggest you to do is to edit your .yaml file in order to configure your module, and specify your Instance class to be F2 or higher.
In case you are not using modules, you should also add “module: default” at the beginning of your app.yaml file to let GAE know that this is your default module.
You can take a look to this article from the docs to see the different Instance classes available, and the way to easily configure them.
Another more basic workaround would be to limit the image size when uploading it, but you will eventually finish having a similar issue.
Regarding the previous matter and a way to optimize your images, you may want to take a look at the App Engine Images API that provides the ability to manipulate image data using a dedicated Images service. In your case, you might like the "I'm Feeling Lucky" transformation. By using this API you might not need to update your Instance class.
Related
Background
I finally convinced someone willing to share his full archival node 5868GiB database for free (which now requires to be built in ram and thus requires 100000$ worth of ram in order to be built but can be run on an ssd once done).
However he want to send it only through sending a single tar file over raw tcp using a rather slow (400Mps) connection for this task.
I m needing to get it on dropbox and as a result, he don’t want to use the https://www.dropbox.com/request/[my upload key here] allowing to upload files through a web browser without a dropbox account (it really annoyed him that I talked about using an other method or compressing the database to the point he is on the verge of changing his mind about sharing it).
Because on my side, dropbox allows using 10Tib of storage for free during 30 days and I didn’t receive the required ssd yet (so once received I will be able to download it using a faster speed).
The problem
I m fully aware of upload file to my dropbox from python script but in my case the file doesn t fit into a memory buffer not even on disk.
And previously in api v1 it wasn t possible to append data to an exisiting file (but I didn t find the answer for v2).
To upload a large file to the Dropbox API using the Dropbox Python SDK, you would use upload sessions to upload it in pieces. There's a basic example here.
Note that the Dropbox API only supports files up to 350 GB though.
One csv file is uploaded to the cloud storage everyday around 0200 hrs but sometime due to job fail or system crash file upload happens very late. So I want to create a cloud function that can trigger my python bq load script whenever the file is uploaded to the storage.
file_name : seller_data_{date}
bucket name : sale_bucket/
The question lacks enough description of the desired usecase and any issues the OP has faced. However, here are a few possible approaches that you might chose from depending on the usecase.
The simple way: Cloud Functions with Storage trigger.
This is probably the simplest and most efficient way of running a Python function whenever a file gets uploaded to your bucket.
The most basic tutorial is this.
The hard way: App Engine with a few tricks.
Having a basic Flask application hosted on GAE (Standard or Flex), with an endpoint specifically to handle this chek of the files existing, download object, manipulate it and then do something.
This route can act as a custom HTTP triggered function, where once it receives a request (could be from a simple curl request, visit from the browser, PubSub event, or even another Cloud Function).
Once it receives a GET (or POST) request, it downloads the object into the /tmp dir, process it and then do something.
The small benefit with GAE over CF is that you can set a minimum of one instance to stay always alive which means you will not have the cold starts, or risk the request timing out before the job is done.
The brutal/overkill way: Clour Run.
Similar approach to App Engine, but with Cloud Run you'll also need to work with the Dockerfile, have in mind that Cloud Run will scale down to zero when there's no usage, and other minor things that apply to building any application on Cloud Run.
########################################
For all the above approaches, some additional things you might want to achieve are the same:
a) Downloading the object and doing some processing on it:
You will have to download it to the /tmp directory as it's the directory for both GAE and CF to store temporary files. Cloud Run is a bit different here but let's not get deep into it as it's an overkill byitself.
However, keep in mind that if your file is large you might cause a high memory usage.
And ALWAYS clean that directory after you have finished with the file. Also when opening a file always use with open ... as it will also make sure to not keep files open.
b) Downloading the latest object in the bucket:
This is a bit tricky and it needs some extra custom code. There are many ways to achieve it, but the one I use (always tho paying close attention to memory usage), is upon the creation of the object I upload to the bucket, I get the current time, use Regex to transform it into something like results_22_6.
What happens now is that once I list the objects from my other script, they are already listed in an accending order. So the last element in the list is the latest object.
So basically what I do then is to check if the filename I have in /tmp is the same as the name of the object[list.length] in the bucket. If yes then do nothing, if no then delete the old one and download the latest one in the bucket.
This might not be optimal, but for me it's kinda preferable.
I would like to run a program on my laptop (Gazebo simulator) and send a stream of image data to a GCE instance, where it will be run through an object-detection network and sent back to my laptop in near real-time. Is such a set-up possible?
My best idea right now is, for each image:
Save the image as a JPEG on my personal machine
Stream the JPEG to a Cloud Storage bucket
Access the storage bucket from my GCE instance and transfer the file to the instance
In my python script, convert the JPEG image to numpy array and run through the object detection network
Save the detection results in a text file and transfer to the Cloud Storage bucket
Access the storage bucket from my laptop and download the detection results file
Convert the detection results file to a numpy array for further processing
This seems like a lot of steps, and I am curious if there are ways to speed it up, such as reducing the number of save and load operations or transporting the image in a better format.
If your question is "is it possible to set up such a system and do those actions in real time?" then I think the answer is yes I think so. If your question is "how can I reduce the number of steps in doing the above" then I am not sure I can help and will defer to one of the experts on here and can't wait to hear the answer!
I have implemented a system that I think is similar to what you describe for research of Forex trading algorithms (e.g. upload data to storage from my laptop, compute engine workers pull the data and work on it, post results back to storage and I download the compiled results from my laptop).
I used the Google PubSub architecture - apologies if you have already read up on this. It allows near-realtime messaging between programs. For example you can have code looping on your laptop that scans a folder that looks out for new images. When they appear it automatically uploads the files to a bucket and once theyre in the bucket it can send a message to the instance(s) telling them that there are new files there to process, or you can use the "change notification" feature of Google Storage buckets. The instances can do the work, send the results back to the storage and send a notification to the code running on your laptop that work is done and results are available for pick-up.
Note that I set this up for my project above and encountered problems to the point that I gave up with PubSub. The reason was that the Python Client Library for PubSub only supports 'asynchronous' message pulls, which seems to mean that the subscribers will pull multiple messages from the queue and process them in parallel. There are some features to help manage 'flow control' of messages built into the API, but even with them implemented I couldn't get it to work the way I wanted. For my particular application I wanted to process everything in order, one file at a time because it was important to me that I'm clear what the instance is doing and the order its doing it in. There are several threads on google search, StackOverflow and Google groups that discuss workarounds for this using queues, classes, allocating specific tasks for specific instances, etc which I tried, but even these presented problems for me. Some of these links are:
Run synchronous pull in PubSub using Python client API and pubsub problems pulling one message at a time and there are plenty more if you would like them!
You may find that if the processing of an image is relatively quick, order isn't too important and you don't mind an instance working on multiple things in parallel that my problems don't really apply to your case.
FYI, I ended up just making a simple loop on my 'worker instances' that scans the 'task list' bucket every 30 seconds or whatever to look for new files to process, but obviously this isn't quite the real-time approach that you were originally looking for. Good luck!
I got this error while rendering google app engine code.
Do any body have knowledge about this error?
Are you using appstats? It looks like this can happen when appstats is recording state about your app, especially if you're storing lots of data on the stack. It isn't harmful, but you won't be able to see everything when inspecting calls in appstats.
I would like to download all blobs as a single zipped file (or another way) to my computer. Anyway to do that? I use the python SDK.
No, there's no way to do this. The blobstore can be aribrarily large, far larger than is practical to download in a single file.
There is a request deadline of 60 seconds for each web request sent to GAE. One request cannot have a response larger than 32 megs, nor can its handler generally use more than 128 megs of memory using the default quotas.
So hypothetically, if you have a very small application, maybe you could assemble a zip in-memory of all your blobs. But that's not going to be scalable, and if your blostore is so small anyway, is it worth it? (No, it isn't).
Bottom line is, very little in GAE is done all-at-once. You do things iteratively, over multiple requests.
It's probably better to download them one at a time anyway. That way if your job dies part way through you can restart for where it failed rather than starting over again from scratch.
How about copy the blobs to google storage and then use gsutil to download the blobs from there?