Possibility to allow long initialization of the instance in GAE? - python

I'm writing a software that will automatically process messages from the user's gmail inbox (business inbox in our domain). The intention is to deploy this software to Google App Engine and I can't get my head around the following scenario:
Upon the start of the software I'd like it to process the messages that accumulated from before the automation was started. As I expect hundreds of messages in the inbox, this phase might take some time. I was looking into the following options:
The synchronous way: Start a separate long-running thread to process the pending messages synchronously and use the main thread to put notifications about new incoming messages into a separate queue that will be processed later.
Result: the thread responsible for processing the pending messages is killed as soon as the main thread is through the rest of the code. This is sort-of expected now after reading the docs:
Note that threads will be joined by the runtime when the request ends so the threads cannot run past the end of the request.
The asynchronous way: Use the batch calls for getting the messages and process the results by a callback. Result: every-time the result was expected the instance has been restarted and context lost.
from apiclient.http import BatchHttpRequest
batch = self.service.new_batch_http_request(callback=self.cbk_request_messages)
batch.add(self.service.users().messages().get(userId='me', id=id, format='metadata'))
batch.execute()
Or am I using GAE in a completely wrong way here? Thanks in advance.

It's true, you lose the context of the original request, but nothing stops you from extracting the needed info from that context and pass it to subsequent requests so that they can re-create that context. Of course, you'll have to persist only the relevant info from the original instance/request context, not the ephemeral context itself.
For example, let's say that the original request is the one displaying the homepage for the user after login, when there's a ton of new emails that need processing. You won't be able to process those messages fast enough to immediately display in that page that the emails were processed. But you can get the mailbox information and enqueue a mailbox processing task in a push queue with that mailbox info as a parameter.
The task gets the mailbox info, processes a batch of emails from that mailbox and, if there still are emails to be processed, enqueues another such task, propagating the mailbox info to it. These tasks keep running until there are no more emails to be processed.
You can even have the homepage issue repeated AJAX requests in the background to obtain the status of the email processing tasks and maybe display some progress information like X emails yet to be processed, if you want.

Related

How to detect Celery task which doing similar job before run another task?

My celery task is doing time-consuming calculations on some database-stored entity. Workflow is like this: get information from database, compile it to some serializable object, save object. Other tasks are doing other calculations (like rendering images) on loaded object.
But serialization is time-consuming, so i'd like to have one task per one entity running for a while, which holds serialized object in memory and process client requests, delivered through messaging queue (redis pubsub). If no requests for a while, task exits. After that, if client need some job to be done, it runs another task, which loads object, process it and stay tuned for a while for other jobs. This task should check at startup, if it only one worker on this particular entity to avoid collisions. So what is best strategy to check is there another task running for this entity?
1) First idea is to send message to some channel associated with entity, and wait for response. Bad idea, target task can be busy with calculations and waiting for response with timeout is just wasting time.
2) Store celery task-id in db is even worse - task can be killed, but record will stay, so we need to ensure that target task is alive.
3) Third idea is to inspect workers for running tasks, checking it state for entity id (which task will provide at startup). Also seems, that some collisions can happens, i.e. if several tasks are scheduled, but not runing yet.
For now I think idea 1 is the best with modifications like this: task will send message to entity channel on startup with it's startup time, but then immediately starts working, not waiting for response. Then it checks message queue and if someone is respond they compare timestamps and task with bigger timestamp quits. Seems complicated enough, are there better solution?
Final solution is to start supervisor thread in task, which reply to 'discover' message from competing tasks.
So workflow is like that.
Task starts, then subscribes to Redis PubSub channel with entity ID
Task sends 'discover' message to channel
Task wait a little bit
Task search 'reply' in incoming messages in channel, if found exits.
Task starts supervisor thread, which reply by 'reply' to all incoming 'discover' messages
This works fine except several tasks start simultaneouly, i.e. after worker restart. To avoid this need to make subscription proccess atomic, using Redis lock:
class RedisChannel:
def __init__(self, channel_id):
self.channel_id = channel_id
self.redis = StrictRedis()
self.channel = self.redis.pubsub()
with self.redis.lock(channel_id):
self.channel.subscribe(channel_id)

Batch processing of incoming notifications with GAE

My app engine app receives notifications from SendGrid for processing email deliveries, opens, etc. Sendgrid doesn't do much batching of these notifications so I could receive several per second.
I'd like to do batch processing of the incoming notifications, such as processing all of the notifications received in the last minute (my processing includes transactions so I need to combine them to avoid contention). There seems to be several ways of doing this...
For storing the incoming notifications, I could:
add an entity to the datastore or
create a pull queue task.
For triggering processing, I could:
Run a CRON job every minute (is this a good idea?) or
Have the handler that processes the incoming Sendgrid requests trigger processing of notifications but only if the last trigger was more than a minute ago (could store a last trigger date in memcache).
I'd love to hear pros and cons of the above or other approaches.
After a couple of days, I've come up with an implementation that works pretty well.
For storing incoming notifications, I'm storing the data in a pull queue task. I didn't know at the time of my question that you can actually store any raw data you want in a task, and the task doesn't have to itself be the execution of a function. You probably could store the incoming data in the datastore, but then you'd sort of be creating your own pull tasks so you might as well you the pull tasks provided by GAE.
For triggering a worker to process tasks in the pull queue, I came across this excellent blog post about On-demand Cron Jobs by a former GAE developer. I don't want to repeat that entire post here, but the basic idea is that each time you add a task to the pull queue, you create a worker task (regular push queue) to process tasks in the pull queue. For the worker task, you add a task name corresponding to a time interval to make sure you only have one worker task in the time interval. It allows you to get the benefit of 1-minute CRON job but the added performance bonus that it only runs when needed so you don't have a CRON job running when not needed.

Best practice for polling an AWS SQS queue and deleting received messages from queue?

I have an SQS queue that is constantly being populated by a data consumer and I am now trying to create the service that will pull this data from SQS using Python's boto.
The way I designed it is that I will have 10-20 threads all trying to read messages from the SQS queue and then doing what they have to do on the data (business logic), before going back to the queue to get the next batch of data once they're done. If there's no data they will just wait until some data is available.
I have two areas I'm not sure about with this design
Is it a matter of calling receive_message() with a long time_out value and if nothing is returned in the 20 seconds (maximum allowed) then just retry? Or is there a blocking method that returns only once data is available?
I noticed that once I receive a message, it is not deleted from the queue, do I have to receive a message and then send another request after receiving it to delete it from the queue? seems like a little bit of an overkill.
Thanks
The long-polling capability of the receive_message() method is the most efficient way to poll SQS. If that returns without any messages, I would recommend a short delay before retrying, especially if you have multiple readers. You may want to even do an incremental delay so that each subsequent empty read waits a bit longer, just so you don't end up getting throttled by AWS.
And yes, you do have to delete the message after you have read or it will reappear in the queue. This can actually be very useful in the case of a worker reading a message and then failing before it can fully process the message. In that case, it would be re-queued and read by another worker. You also want to make sure the invisibility timeout of the messages is set to be long enough the the worker has enough time to process the message before it automatically reappears on the queue. If necessary, your workers can adjust the timeout as they are processing if it is taking longer than expected.
If you want a simple way to set up a listener that includes automatic deletion of messages when they're finished being processed, and automatic pushing of exceptions to a specified queue, you can use the pySqsListener package.
You can set up a listener like this:
from sqs_listener import SqsListener
class MyListener(SqsListener):
def handle_message(self, body, attributes, messages_attributes):
run_my_function(body['param1'], body['param2']
listener = MyListener('my-message-queue', 'my-error-queue')
listener.listen()
There is a flag to switch from short polling to long polling - it's all documented in the README file.
Disclaimer: I am the author of said package.
Another option is to setup a worker application using AWS Beanstalk as described in this blogpost.
Instead of long polling using boto3, your flask application receives the message as a json object in a HTTP post. The HTTP path and type of message being set are configurable in the AWS Elastic Beanstalk Configuration tab:
AWS Elastic Beanstalk has the added benefit of being able to dynamically scale the number of workers as a function of the size of your SQS queue, along with its deployment management benefits.
This is an example application that I found useful as a template.

Can AppEngine python threads last longer than the original request?

We're trying to use the new python 2.7 threading ability in Google App Engine and it seems like the created thread is getting killed before it finishes running. Our scenario:
User sends a message to the server
We update the user's data
We spawn a thread to do some more heavy duty processing
We return a response to the user before waiting for the heavy duty processing to finish
My assumption was that the thread would continue to run after the request had returned, as long as it did not exceed the total request time limit. What we're seeing though is that the thread is randomly killed partway through it's execution. No exceptions, no errors, nothing. It just stops running.
Are threads allowed to exist after the response has been returned? This does not repro on the dev server, only on live servers.
We could of course use a task queue instead, but that's a real pain since we'd have to set up a url for the action and serialize/deserialize the data.
The 'Sandboxing' section of this page:
http://code.google.com/appengine/docs/python/python27/using27.html#Sandboxing
indicates that threads cannot run past the end of the request.
Deferred tasks are the way to do this. You don't need a URL or serialization to use them:
from google.appengine.ext import deferred
deferred.defer(myfunction, arg1, arg2)

Storing task state between multiple django processes

I am building a logging-bridge between rabbitmq messages and Django application to store background task state in the database for further investigation/review, also to make it possible to re-publish tasks via the Django admin interface.
I guess it's nothing fancy, just a standard Producer-Consumer pattern.
Web application publishes to message queue and inserts initial task state into the database
Consumer, which is a separate python process, handles the message and updates the task state depending on task output
The problem is, some tasks are missing in the db and therefore never executed.
I suspect it's because Consumer receives the message earlier than db commit is performed.
So basically, returning from Model.save() doesn't mean the transaction has ended and the whole communication breaks.
Is there any way I could fix this? Maybe some kind of post_transaction signal I could use?
Thank you in advance.
Web application publishes to message queue and inserts initial task state into the database
Do not do this.
Web application publishes to the queue. Done. Present results via template and finish the web transaction.
A consumer fetches from the queue and does things. For example, it might append to a log to the database for presentation to the user. The consumer may also post additional status to the database as it executes things.
Indeed, many applications have multiple queues with multiple produce/consumer relationships. Each process might append things to a log.
The presentation must then summarize the log entries. Often, the last one is a sufficient summary, but sometimes you need a count or information from earlier entries.
This sounds brittle to me: You have a web app which posts to a queue and then inserts the initial state into the database. What happens if the consumer processes the message before the web app can commit the initial state?
What happens if the web app tries to insert the new state while the DB is locked by the consumer?
To fix this, the web app should add the initial state to the message and the consumer should be the only one ever writing to the DB.
[EDIT] And you might also have an issue with logging. Check that races between the web app and the consumer produce the appropriate errors in the log by putting a message to the queue without modifying the DB.
[EDIT2] Some ideas:
How about showing just the number of pending tasks? For this, the web app could write into table 1 and the consumer writes into table 2 and the admin if would show the difference.
Why can't the web app see the pending tasks which the consumer has in the queue? Maybe you should have two consumers. The first consumer just adds the task to the DB, commits and then sends a message to the second consumer with just the primary key of the new row. The admin iface could read the table while the second consumer writes to it.
Last idea: Commit the transaction before you enqueue the message. For this, you simply have to send "commit" to the database. It will feel odd (and I certainly don't recommend it for any case) but here, it might make sense to commit the new row manually (i.e. before you return to your framework which handles the normal transaction logic).

Categories