How to create tasks with shared lock in celery? - python

I am new to both python and celery, and not quite sure how to define my problem regarding celery.
I will try to explain my problem with an example;
Lets say there is a system where users buy tickets for events. Each event has a fixed cap of tickets(10). Requests to API is synchronous and buy requests return a success or fail depending of the event's capacity. Obviously this might create a race-condition for that resource.
There are two solutions I could think of:
Somehow add a database constraint that will block the insert requests when event is full.(Could not find any applicable solution btw)
Add a sequential task queue for buy ticket events. This way it is ensured that there will be no race. However since I am waiting for the API response, with enough load it might take ages for the task to be executed, if there is only 1 queue. To reduce this waiting time, I have to implement a event-id-based queue mechanism, where each event has its own sequential queue created dynamically, and will have its own consumer. Once the event is full consumer should be dissolved. Is there any way to achieve this with celery?
Thanks.

Related

Best approach to tackle long polling in server side

I have a use case where I need to poll the API every 1 sec (basically infinite while loop). The polling will be initiated dynamically by user through an external system. This means there can be multiple polling running at the same time. The polling will be completed when the API returns 400. Anyways, my current implementation looks something like:
Flask APP deployed on heroku.
Flask APP has an endpoint which external system calls to start polling.
That flask endpoint will add the message to queue and as soon as worker gets it, it will start polling. I am using Heroku Redis to Go addons. Under the hood it uses python-rq and redis.
The problem is when some polling process goes on for a long time, the other process just sits on the queue. I want to be able to do all of the polling in a concurrent process.
What's the best approach to tackle this problem? Fire up multiple workers?
What if there could be potentially more than 100 concurrent processes.
You could implement a "weighted"/priority queue. There may be multiple ways of implementing this, but the simplest example that comes to my mind is using a min or max heap.
You shoud keep track of how many events are in the queue for each process, as the number of events for one process grows, the weight of the new inserted events should decrease. Everytime an event is processed, you start processing the following one with the greatest weight.
PS More workers will also speed up the the work.

Is it a bad practice to use sleep() in a web server in production?

I'm working with Django1.8 and Python2.7.
In a certain part of the project, I open a socket and send some data through it. Due to the way the other end works, I need to leave some time (let's say 10 miliseconds) between each data that I send:
while True:
send(data)
sleep(0.01)
So my question is: is it considered a bad practive to simply use sleep() to create that pause? Is there maybe any other more efficient approach?
UPDATED:
The reason why I need to create that pause is because the other end of the socket is an external service that takes some time to process the chunks of data I send. I should also point out that it doesnt return anything after having received or let alone processed the data. Leaving that brief pause ensures that each chunk of data that I send gets properly processed by the receiver.
EDIT: changed the sleep to 0.01.
Yes, this is bad practice and an anti-pattern. You will tie up the "worker" which is processing this request for an unknown period of time, which will make it unavailable to serve other requests. The classic pattern for web applications is to service a request as-fast-as-possible, as there is generally a fixed or max number of concurrent workers. While this worker is continually sleeping, it's effectively out of the pool. If multiple requests hit this endpoint, multiple workers are tied up, so the rest of your application will experience a bottleneck. Beyond that, you also have potential issues with database locks or race conditions.
The standard approach to handling your situation is to use a task queue like Celery. Your web-application would tell Celery to initiate the task and then quickly finish with the request logic. Celery would then handle communicating with the 3rd party server. Django works with Celery exceptionally well, and there are many tutorials to help you with this.
If you need to provide information to the end-user, then you can generate a unique ID for the task and poll the result backend for an update by having the client refresh the URL every so often. (I think Celery will automatically generate a guid, but I usually specify one.)
Like most things, short answer: it depends.
Slightly longer answer:
If you're running it in an environment where you have many (50+ for example) connections to the webserver, all of which are triggering the sleep code, you're really not going to like the behavior. I would strongly recommend looking at using something like celery/rabbitmq so Django can dump the time delayed part onto something else and then quickly respond with a "task started" message.
If this is production, but you're the only person hitting the webserver, it still isn't great design, but if it works, it's going to be hard to justify the extra complexity of the task queue approach mentioned above.

Design for implementing web hooks (including not blocking and ignoring superseding repeat events)

I'm implementing a webhooks provider and trying to solve some problems while minimizing the added complexity to my system:
Not blocking processing of the API call that triggered the event while calling all the hooks so the response to that call will not be delayed
Not making a flood of calls to my listeners if some client is quickly calling my APIs that trigger hooks (i.e. wait a couple seconds and throw away any earlier calls if duplicates come in later)
My environment is Python (Chalice) and AWS Lambda. Ideal solution will be easy to integrate and cheap.
I would use SQS / SNS depending on the exact architecture design. Maybe Apache Kafka, if you need to store events longer...
So upcoming events would be placed on SQS, and then another lambda would be used to do the processing. Problem is that time of processing is limited to 5 min. Also delivering can't be parallel.
Another option is to have one input queue, and one output queue per receiver. So the lambda function, which processes input, just spreads it through other queues. And then other lambdas are responsible for delivering. That way has other obvious problems.
Finally. Your lambda, while processing input, can generate messages on the outgoing queue, instrumenting what message should be delivered to which users. Then you can have one lambda triggered on each message from the outgoing queue. And there you can have a small loop delivering messages. Note that in case of problems you need to send back what was not delivered.
Good point is that SQS has something like a dead letter queue so that problematic messages would not stay there forever.

How can Celery distribute users' tasks in a fair way?

The task I'm implementing is related to scrape some basic info about a URL, such as title, description and OGP metadata. If User A requests 200 URLs to scrape, and after User B requests for 10 URLs, User B may wait much more than s/he expect.
What I'm trying to achieve is to rate limit a specific task on a per user basis or, at least, to be fair between users.
The Celery implementation for rate limiting is too broad, since it uses the task name only
Do you have any suggestion to achieve this kind of fairness?
Related Celery (Django) Rate limiting
Another way would be to rate limit individual users using a lock. Use the user id as the lock name. If the lock is already held retry after some task dependent delay.
Basically, do this:
Ensuring a task is only executed one at a time
Lock on the user id and retry instead of doing nothing if the lock can't be acquired. Also, it would be better to use Redis instead of the the Django cache, but either way will work.
One way to work this around could be to control that a user does not enqueue more than x tasks, which means counting for each user the number of non-processed tasks enqueued (on the django side, not trying to do this with celery).
How about, instead of running all URL scrapes in a single task, make each scrape into a single task and then run them as chains or groups?

Threading-type solution with google app engine

I have a "queue" of about a million entities on google app engine. I have to "pop" items off of the queue by using a query.
There are a bunch of client processes running all over the place that are constantly making requests to the stack. My problem is that when one of the clients requests an item, I want to make sure that I am removing that item from the front of the queue, sending it to that client process, and no other processes.
Currently, I am querying for the item, modifying its properties so that a query to the queue no longer includes that item, then saving the item. Using this method, it is very common for one item to be sent to more than one client process at the same time. I suspect this is because there is a delay to when I am making the writes and when they are being reflected to other processes.
Perhaps I need to be using transactions in some way, but when I looked into that, there were a couple of "gotchas". What is a good way to approach this problem?
Is there any reason not to implement the "queue" using App Engine's TaskQueue API? If size of the queue is the problem, TaskQueue could contain up to 200 million Tasks for a paid app, so a million entities would be easily handled.
If you want to be able to simulate queries for a certain task in the queue, you could use task tags, and have your client process pull tasks with a certain tag to be processed. Note that pulling tasks is supported through pull queues rather than push queues.
Other than that, if you want to keep your "queue-as-entities" implementation, you could use the Memcache API to signal the client process which entity need to be processed. Memcache provides stronger consistency when you need to share data between instances of your app compared to the eventual consistency of the HRD datastore, with the caveat that data in Memcache could be lost at any point in time.
I see two ways to tackle this:
What you are doing is ok, you just need to use transactions. If your processes are longer then 30s then you can offload them to task queue, which can be a part of transaction.
You could use Pull Queues, where you fill up a queue and than client processes pull tasks from the queue in atomic fashion (lease-delete cycle). With Pull Queues you can be sure that task is leased only once. Also task must be manually deleted from queue after it's done, meaning if your process dies task will be put back in queue after lease expires.

Categories