Storing task state between multiple django processes - python

I am building a logging-bridge between rabbitmq messages and Django application to store background task state in the database for further investigation/review, also to make it possible to re-publish tasks via the Django admin interface.
I guess it's nothing fancy, just a standard Producer-Consumer pattern.
Web application publishes to message queue and inserts initial task state into the database
Consumer, which is a separate python process, handles the message and updates the task state depending on task output
The problem is, some tasks are missing in the db and therefore never executed.
I suspect it's because Consumer receives the message earlier than db commit is performed.
So basically, returning from Model.save() doesn't mean the transaction has ended and the whole communication breaks.
Is there any way I could fix this? Maybe some kind of post_transaction signal I could use?
Thank you in advance.

Web application publishes to message queue and inserts initial task state into the database
Do not do this.
Web application publishes to the queue. Done. Present results via template and finish the web transaction.
A consumer fetches from the queue and does things. For example, it might append to a log to the database for presentation to the user. The consumer may also post additional status to the database as it executes things.
Indeed, many applications have multiple queues with multiple produce/consumer relationships. Each process might append things to a log.
The presentation must then summarize the log entries. Often, the last one is a sufficient summary, but sometimes you need a count or information from earlier entries.

This sounds brittle to me: You have a web app which posts to a queue and then inserts the initial state into the database. What happens if the consumer processes the message before the web app can commit the initial state?
What happens if the web app tries to insert the new state while the DB is locked by the consumer?
To fix this, the web app should add the initial state to the message and the consumer should be the only one ever writing to the DB.
[EDIT] And you might also have an issue with logging. Check that races between the web app and the consumer produce the appropriate errors in the log by putting a message to the queue without modifying the DB.
[EDIT2] Some ideas:
How about showing just the number of pending tasks? For this, the web app could write into table 1 and the consumer writes into table 2 and the admin if would show the difference.
Why can't the web app see the pending tasks which the consumer has in the queue? Maybe you should have two consumers. The first consumer just adds the task to the DB, commits and then sends a message to the second consumer with just the primary key of the new row. The admin iface could read the table while the second consumer writes to it.
Last idea: Commit the transaction before you enqueue the message. For this, you simply have to send "commit" to the database. It will feel odd (and I certainly don't recommend it for any case) but here, it might make sense to commit the new row manually (i.e. before you return to your framework which handles the normal transaction logic).

Related

Persisting all job results in a separate db in Celery

We are running an API server where users submit jobs for calculation, which take between 1 second and 1 hour. They then make requests to check the status and get their results, which could be (much) later, or even never.
Currently jobs are added to a pub/sub queue, and processed by various worker processes. These workers then send pub/sub messages back to a listener, which stores the status/results in a postgres database.
I am looking into using Celery to simplify things and allow for easier scaling.
Submitting jobs and getting results isn't a problem in Celery, using celery_app.send_task. However, I am not sure how to best ensure the results are stored when, particularly for long-running or possibly abandoned jobs.
Some solutions I considered include:
Give all workers access to the database and let them handle updates. The main limitation to this seems to be the db connection pool limit, as worker processes can scale to 50 replicas in some cases.
Listen to celery events in a separate pod, and write changes based on this to the jobs db. Only 1 connection needed, but as far as I understand, this would miss out on events while this pod is redeploying.
Only check job results when the user asks for them. It seems this could lead to lost results when the user takes too long, or slowly clog the results cache.
As in (3), but periodically check on all jobs not marked completed in the db. A tad complicated, but doable?
Is there a standard pattern for this, or am I trying to do something unusual with Celery? Any advice on how to tackle this is appreciated.
In the past I solved similar problem by modifying tasks to not only return result of the computation, but also store it into a cache server (Redis) right before it returns. I had a task that periodically (every 5min) collects these results and writes data (in bulk, so quite effective) to a relational database. It was quite effective until we started filling the cache with hundreds of thousands of results, so we implemented a tiny service that does this instead of task that runs periodically.

Possibility to allow long initialization of the instance in GAE?

I'm writing a software that will automatically process messages from the user's gmail inbox (business inbox in our domain). The intention is to deploy this software to Google App Engine and I can't get my head around the following scenario:
Upon the start of the software I'd like it to process the messages that accumulated from before the automation was started. As I expect hundreds of messages in the inbox, this phase might take some time. I was looking into the following options:
The synchronous way: Start a separate long-running thread to process the pending messages synchronously and use the main thread to put notifications about new incoming messages into a separate queue that will be processed later.
Result: the thread responsible for processing the pending messages is killed as soon as the main thread is through the rest of the code. This is sort-of expected now after reading the docs:
Note that threads will be joined by the runtime when the request ends so the threads cannot run past the end of the request.
The asynchronous way: Use the batch calls for getting the messages and process the results by a callback. Result: every-time the result was expected the instance has been restarted and context lost.
from apiclient.http import BatchHttpRequest
batch = self.service.new_batch_http_request(callback=self.cbk_request_messages)
batch.add(self.service.users().messages().get(userId='me', id=id, format='metadata'))
batch.execute()
Or am I using GAE in a completely wrong way here? Thanks in advance.
It's true, you lose the context of the original request, but nothing stops you from extracting the needed info from that context and pass it to subsequent requests so that they can re-create that context. Of course, you'll have to persist only the relevant info from the original instance/request context, not the ephemeral context itself.
For example, let's say that the original request is the one displaying the homepage for the user after login, when there's a ton of new emails that need processing. You won't be able to process those messages fast enough to immediately display in that page that the emails were processed. But you can get the mailbox information and enqueue a mailbox processing task in a push queue with that mailbox info as a parameter.
The task gets the mailbox info, processes a batch of emails from that mailbox and, if there still are emails to be processed, enqueues another such task, propagating the mailbox info to it. These tasks keep running until there are no more emails to be processed.
You can even have the homepage issue repeated AJAX requests in the background to obtain the status of the email processing tasks and maybe display some progress information like X emails yet to be processed, if you want.

Best practice for polling an AWS SQS queue and deleting received messages from queue?

I have an SQS queue that is constantly being populated by a data consumer and I am now trying to create the service that will pull this data from SQS using Python's boto.
The way I designed it is that I will have 10-20 threads all trying to read messages from the SQS queue and then doing what they have to do on the data (business logic), before going back to the queue to get the next batch of data once they're done. If there's no data they will just wait until some data is available.
I have two areas I'm not sure about with this design
Is it a matter of calling receive_message() with a long time_out value and if nothing is returned in the 20 seconds (maximum allowed) then just retry? Or is there a blocking method that returns only once data is available?
I noticed that once I receive a message, it is not deleted from the queue, do I have to receive a message and then send another request after receiving it to delete it from the queue? seems like a little bit of an overkill.
Thanks
The long-polling capability of the receive_message() method is the most efficient way to poll SQS. If that returns without any messages, I would recommend a short delay before retrying, especially if you have multiple readers. You may want to even do an incremental delay so that each subsequent empty read waits a bit longer, just so you don't end up getting throttled by AWS.
And yes, you do have to delete the message after you have read or it will reappear in the queue. This can actually be very useful in the case of a worker reading a message and then failing before it can fully process the message. In that case, it would be re-queued and read by another worker. You also want to make sure the invisibility timeout of the messages is set to be long enough the the worker has enough time to process the message before it automatically reappears on the queue. If necessary, your workers can adjust the timeout as they are processing if it is taking longer than expected.
If you want a simple way to set up a listener that includes automatic deletion of messages when they're finished being processed, and automatic pushing of exceptions to a specified queue, you can use the pySqsListener package.
You can set up a listener like this:
from sqs_listener import SqsListener
class MyListener(SqsListener):
def handle_message(self, body, attributes, messages_attributes):
run_my_function(body['param1'], body['param2']
listener = MyListener('my-message-queue', 'my-error-queue')
listener.listen()
There is a flag to switch from short polling to long polling - it's all documented in the README file.
Disclaimer: I am the author of said package.
Another option is to setup a worker application using AWS Beanstalk as described in this blogpost.
Instead of long polling using boto3, your flask application receives the message as a json object in a HTTP post. The HTTP path and type of message being set are configurable in the AWS Elastic Beanstalk Configuration tab:
AWS Elastic Beanstalk has the added benefit of being able to dynamically scale the number of workers as a function of the size of your SQS queue, along with its deployment management benefits.
This is an example application that I found useful as a template.

Display progress of a long running Python task in Django

I currently have a typical Django structure set up for a project and one web application.
The web application is set up so that a user inputs some information, and this information is taken as the input to run a Python program.
This python program sometimes can take quite a while to finish (grabbing things from the web and doing some text mining scoring) - sometimes it can take multiple minutes to load.
On the command line, this program would periodically display where it was in the process (it'd first say how many things it found to score against, then it'd say where in the number of things found it is in the scoring process), which was very useful. However, when I moved this over to a Django set up, I no longer have this capability (at least, not in the same way since now this is sent to log files).
The way I set it up is that there is an input view, and then a results view. The results view takes the input and runs the Python program. It won't display the results until the entire program is run. So on the user side, the browser just sits there for sometimes minutes before the results are displayed. Obviously, this is not ideal.
Does anyone know of the best way to bring status information on a task to Django?
I've looked into Celery a little bit, but I think since I'm still a beginner in Django that I'm confusing myself with some of the documentation. For instance: even if the task is sent off asynchronously to a worker, how does the browser grab the current state of the program?? Also, consistent documentation seems to be lacking for celery on Django (I've seen people set up celery many different ways on their Django projects).
I would appreciate any input here, I've been stuck on this for a while now.
My first suggestion is to psychologically separate celery from django when you start to think of the two. They can run in the same environment, but celery is to asynchronous processes what django is to http requests.
Also remember that celery is unlike diango in that it requires other services to function; a message broker. So by using celery you will increase your architectural requirements.
To address you specific use case, you'll need a system to publish messages from each celery task to a message broker and your web client will need to subscribe to those messages.
There's a lot involved here, but the short version is that you can use Redis as your celery message broker as well as your pub/sub service to get messages back to the browser. You can then use e.g diango-redis-websockets to subscribe the browser to the task state messages in redis

Threading-type solution with google app engine

I have a "queue" of about a million entities on google app engine. I have to "pop" items off of the queue by using a query.
There are a bunch of client processes running all over the place that are constantly making requests to the stack. My problem is that when one of the clients requests an item, I want to make sure that I am removing that item from the front of the queue, sending it to that client process, and no other processes.
Currently, I am querying for the item, modifying its properties so that a query to the queue no longer includes that item, then saving the item. Using this method, it is very common for one item to be sent to more than one client process at the same time. I suspect this is because there is a delay to when I am making the writes and when they are being reflected to other processes.
Perhaps I need to be using transactions in some way, but when I looked into that, there were a couple of "gotchas". What is a good way to approach this problem?
Is there any reason not to implement the "queue" using App Engine's TaskQueue API? If size of the queue is the problem, TaskQueue could contain up to 200 million Tasks for a paid app, so a million entities would be easily handled.
If you want to be able to simulate queries for a certain task in the queue, you could use task tags, and have your client process pull tasks with a certain tag to be processed. Note that pulling tasks is supported through pull queues rather than push queues.
Other than that, if you want to keep your "queue-as-entities" implementation, you could use the Memcache API to signal the client process which entity need to be processed. Memcache provides stronger consistency when you need to share data between instances of your app compared to the eventual consistency of the HRD datastore, with the caveat that data in Memcache could be lost at any point in time.
I see two ways to tackle this:
What you are doing is ok, you just need to use transactions. If your processes are longer then 30s then you can offload them to task queue, which can be a part of transaction.
You could use Pull Queues, where you fill up a queue and than client processes pull tasks from the queue in atomic fashion (lease-delete cycle). With Pull Queues you can be sure that task is leased only once. Also task must be manually deleted from queue after it's done, meaning if your process dies task will be put back in queue after lease expires.

Categories