How to avoid race conditions on celery tasks?

How to avoid race conditions on celery tasks? - python

Considering the following task:
#app.task(ignore_result=True)
def withdraw(user, requested_amount):
if user.balance >= requested_amount:
send_money(requested_amount)
user.balance -= requested_amount
user.save()
If this task gets executed twice, at the same time, it would result in an user with a negative balance... how can I solve it? It is just an example of race, but there are lots of situations like this in my code..

You can use this Celery cookbook recipe to implement a Lock which will make sure that only one task is run at a time, and then perhaps implement retry logic to then try the second task again later. Something like this;
def import_feed(self, feed_url):
# The cache key consists of the task name and the MD5 digest
# of the feed URL.
feed_url_hexdigest = md5(feed_url).hexdigest()
lock_id = '{0}-lock-{1}'.format(self.name, feed_url_hexdigest)
logger.debug('Importing feed: %s', feed_url)
with memcache_lock(lock_id, self.app.oid) as acquired:
if acquired:
return Feed.objects.import_feed(feed_url).url
self.retry(countdown = 2)

Related

How to get in prefect core local task final state?

I have built a flow which implicitly skips running a given task if a kwarg is empty.
I use something like this within task function for skipping logic:
if kwargs.get('processors', Hierarchy()).__len__() == 0:
raise signals.SKIP('skipping task',
result=Prediction())
I want to build some unit tests to make sure that the final state of said task is skipped. What is the easiest way to get state at a task level?
I can see from docs how to get for a flow but not for a task.
Update
To add to Chris's response, I used his 1st proposed option. As my flow is defined outside of the tests I created a simple function to get a set of tasks that had skipped. In the test this was compared against a list of tasks that should have skipped:
def get_skipped_tasks(flow_state):
return set(key.name for key, value in flow_state.result.items() if value.is_skipped())

There are a few ways that I'll include here for completeness; for my example I'll use this basic flow:
from prefect import task, Flow
from prefect.engine.signals import SKIP
import random
#task
def random_number():
return random.randint(0, 100)
#task
def is_even(num):
if num % 2:
raise SKIP("odd number")
return True
with Flow("dummy") as flow:
even_task = is_even(random_number)
Run the whole flow
When running interactively you can always run the whole flow and access individual task states from the parent flow run state; note that when you "call" a task (e.g., is_even(random_number)) a copy is created, so you need to track these copies correctly.
flow_state = flow.run()
assert flow_state.result[even_task].is_skipped() # for example
Run a piece of the flow with mocked data
When running interactively you can also pass a dictionary of task -> state that the runner will respect; these states can optionally be provided data:
from prefect.engine.state import Success
mocked_state = Success(result=2)
flow_state = flow.run(task_states={random_number: mocked_state})
assert not flow_state.result[even_task].is_skipped()
Use a TaskRunner
Lastly, if you want to run state-based tests on this task alone you can use a TaskRunner. This gets a little more complicated because you have to recreate the upstream dependencies using Edges.
from prefect.engine.task_runner import TaskRunner
from prefect.edge import Edge
runner = TaskRunner(task=even_task)
edge = Edge(key="num", upstream_task=random_number, downstream_task=even_task)
task_state = runner.run(upstream_states={edge: mocked_state})
assert not task_state.is_skipped()

Ensure query is atomic

Background
In my code below I have a function called process that does some stuff and when it is running i want to make sure it is not run concurrently. I have a table called creation_status where i set a time stamp anytime i start the process. The reason i use a time stamp is because it allows me to know what time i started this process in case i need to.
I always check if there is already a time stamp and if there is raise an exception to make sure i am not running this script concurrently.
Code
def is_in_process() -> bool:
status = db.run_query(sql="SELECT is_in_process FROM creation_status")
return False if status[0].is_in_process is None else True
def set_status() -> None:
db.execute(sql="UPDATE creation_status SET is_in_process = NOW()")
def delete_status() -> None:
db.execute(sql="UPDATE creation_status SET is_in_process = NULL")
def process():
if is_in_process():
raise Exception("Must not run concurrent process creations." )
set_status()
# stuff happens
delete_status()
Issue
I want to make sure my query is atomic to eliminate race conditions. It is possible that the by the time i check the function is_in_process and call the function set_status another script could get kicked off. How do i ensure both those things happen in one go so i avoid race conditions.
Please let me know if i can explain something more clear and i am open to all suggestions.

Don't use multiple steps when you don't need to.
UPDATE creation_status SET is_in_process = NOW() where is_in_process is null returning is_in_process
Then check to see if a row is returned.
Of course this should probably be done in the same transaction as the rest of the stuff, which we can't tell from your code, but then things will just block until the previous is done, rather than aborting.

How to get and delete record simultaneously in SqlAlchemy?

Some processes at the same time read table. Each process takes on one task. Is it possbile don't use LOCK table in this case ?
db.session.execute('LOCK TABLE "Task"')
query = db.session.query(models.Task).order_by(models.Task.ordr).limit(1)
for row in query:
task = row
db.session.delete(row)
db.session.commit()

By locking table you use pessimistic approach to concurrency.
Alterntively, intead of locking the table, you can be optimistic about the things going the right way. I would wrap the code to retrieve a task to work on in a continues retry statement with error handling in case the commit fails because some other process already removed this very task this process tried to get.
Something like this, perhaps:
def get_next_task():
session = ...
task = None
while not(task):
try:
query = session.query(models.Task).order_by(models.Task.ordr).limit(1)
for row in query:
task = row
session.delete(row)
session.commit()
if not(task):
return # no more tasks found
except TODO_FIND_PROPER_EXCEPTION_TO_HANDLE as _exc:
pass # or log the statement
# maybe need to make_transient
return task
Whether this solution is better will depend on the use case, though.

Django, Celery with Recursion and Twitter API

I'm working with Django 1.4 and Celery 3.0 (rabbitmq) to build an assemblage of tasks for sourcing and caching queries to Twitter API 1.1. One thing I am trying to implement is chain of tasks, the last of which makes a recursive call to the task two nodes back, based on responses so far and response data in most recently retrieved response. Concretely, this allows the app to traverse a user timeline (up to 3200 tweets), taking into account that any given request can only yield at most 200 tweets (limitation on Twitter API).
Key components of my tasks.py can be seen here, but before pasting, I'll show the chain i'm calling from my Python shell (but that will ultimately be launched via user inputs in the final web app). Given:
>>request(twitter_user_id='#1010101010101#,
total_requested=1000,
max_id = random.getrandbits(128) #e.g. arbitrarily large number)
I call:
>> res = (twitter_getter.s(request) |
pre_get_tweets_for_user_id.s() |
get_tweets_for_user_id.s() |
timeline_recursor.s()).apply_async()
The critical thing is that timeline_recursor can initiate a variable number of get_tweets_for_user_id subtasks. When timeline_recursor is in its base case, it should return a response dict as defined here:
#task(rate_limit=None)
def timeline_recursor(request):
previous_tweets=request.get('previous_tweets', None) #If it's the first time through, this will be None
if not previous_tweets:
previous_tweets = [] #so we initiate to empty array
tweets = request.get('tweets', None)
twitter_user_id=request['twitter_user_id']
previous_max_id=request['previous_max_id']
total_requested=request['total_requested']
pulled_in=request['pulled_in']
remaining_requested = total_requested - pulled_in
if previous_max_id:
remaining_requested += 1 #this is because cursored results will always have one overlapping id
else:
previous_max_id = random.getrandbits(128) # for first time through loop
new_max_id = min([tweet['id'] for tweet in tweets])
test = lambda x, y: x<y
if remaining_requested < 0: #because we overshoot by requesting batches of 200
remaining_requested = 0
if tweets:
previous_tweets.extend(tweets)
if tweets and remaining_requested and (pulled_in > 1) and test(new_max_id, previous_max_id):
request = dict(user_pk=user_pk,
twitter_user_id=twitter_user_id,
max_id = new_max_id,
total_requested = remaining_requested,
tweets=previous_tweets)
#problem happens in this part of the logic???
response = (twitter_getter_config.s(request) | get_tweets_for_user_id.s() | timeline_recursor.s()).apply_async()
else: #if in base case, combine all tweets pulled in thus far and send back as "tweets" -- to be
#saved in db or otherwise consumed
response = dict(
twitter_user_id=twitter_user_id,
total_requested = total_requested,
tweets=previous_tweets)
return response
My expected response for res.result is therefore a dictionary comprised of a twitter user id, a requested number of tweets, and the set of tweets pulled in across successive calls.
However, all is not well in recursive task land. When i run the chain identified above, if I enter res.status right after initiating chain, it indicates "SUCCESS", even though in the log view of my celery worker, I can see that chained recursive calls to the twitter api are being made as expected, with the correct parameters. I can also immediately run result.result even as chained tasks are being executed. res.result yields an AsyncResponse instance id. Even after recursively chained tasks have finished running, res.result remains an AsyncResult id.
On the other hand, I can access my set of full tweets by going to res.result.result.result.result['tweets']. I can deduce that each of the chained chained subtasks is indeed occuring, I just don't understand why res.result doesn't have the expected result. The recursive returns that should be happening when timeline_recursor gets its base case don't appear to be propagating as expected.
Any thoughts on what can be done? Recursion in Celery can get quite powerful, but to me at least, it's not totally apparent how we should be thinking of recursion and recursive functions that utilize Celery and how this affects the logic of return statements in chained tasks.
Happy to clarify as needed, and thanks in advance for any advice.

what does apply_async return ( as in type of object )?
i don't know celery, but in Twisted and many other async frameworks... a call to something like that would immediately return ( usually True or perhaps an object that can track state ) as the tasks are deferred into the queue.
again, Not knowing celery , i would guess that this is happening:
you are: defining response immediately as the async deferred task, but then trying to act on it as if results have come in
you want to be: defining a callback routine to run on the results and return a value, once the task has been completed
looking at the celery docs, apply_async accepts callbacks via link - and i couldn't find any example of someone trying to capture a return value from it.

Find out whether celery task exists

Is it possible to find out whether a task with a certain task id exists? When I try to get the status, I will always get pending.
>>> AsyncResult('...').status
'PENDING'
I want to know whether a given task id is a real celery task id and not a random string. I want different results depending on whether there is a valid task for a certain id.
There may have been a valid task in the past with the same id but the results may have been deleted from the backend.

Celery does not write a state when the task is sent, this is partly an optimization (see the documentation).
If you really need it, it's simple to add:
from celery import current_app
# `after_task_publish` is available in celery 3.1+
# for older versions use the deprecated `task_sent` signal
from celery.signals import after_task_publish
# when using celery versions older than 4.0, use body instead of headers
#after_task_publish.connect
def update_sent_state(sender=None, headers=None, **kwargs):
# the task may not exist if sent using `send_task` which
# sends tasks by name, so fall back to the default result backend
# if that is the case.
task = current_app.tasks.get(sender)
backend = task.backend if task else current_app.backend
backend.store_result(headers['id'], None, "SENT")
Then you can test for the PENDING state to detect that a task has not (seemingly)
been sent:
>>> result.state != "PENDING"

AsyncResult.state returns PENDING in case of unknown task ids.
PENDING
Task is waiting for execution or unknown. Any task id that is not
known is implied to be in the pending state.
http://docs.celeryproject.org/en/latest/userguide/tasks.html#pending
You can provide custom task ids if you need to distinguish unknown ids from existing ones:
>>> from tasks import add
>>> from celery.utils import uuid
>>> r = add.apply_async(args=[1, 2], task_id="celery-task-id-"+uuid())
>>> id = r.task_id
>>> id
'celery-task-id-b774c3f9-5280-4ebe-a770-14a6977090cd'
>>> if not "blubb".startswith("celery-task-id-"): print "Unknown task id"
...
Unknown task id
>>> if not id.startswith("celery-task-id-"): print "Unknown task id"
...

Right now I'm using following scheme:
Get task id.
Set to memcache key like 'task_%s' % task.id message 'Started'.
Pass task id to client.
Now from client I can monitor task status(set from task messages to memcache).
From task on ready - set to memcache key message 'Ready'.
From client on task ready - start special task that will delete key from memcache and do necessary cleaning actions.

You need to call .get() on the AsyncTask object you create to actually fetch the result from the backend.
See the Celery FAQ.
To further clarify on my answer.
Any string is technically a valid ID, there is no way to validate the task ID. The only way to find out if a task exists is to ask the backend if it knows about it and to do that you must use .get().
This introduces the problem that .get() blocks when the backend doesn't have any information about the task ID you supplied, this is by design to allow you to start a task and then wait for its completion.
In the case of the original question I'm going to assume that the OP wants to get the state of a previously completed task. To do that you can pass a very small timeout and catch timeout errors:
from celery.exceptions import TimeoutError
try:
# fetch the result from the backend
# your backend must be fast enough to return
# results within 100ms (0.1 seconds)
result = AsyncResult('blubb').get(timeout=0.1)
except TimeoutError:
result = None
if result:
print "Result exists; state=%s" % (result.state,)
else:
print "Result does not exist"
It should go without saying that this only work if your backend is storing results, if it's not there's no way to know if a task ID is valid or not because nothing is keeping a record of them.
Even more clarification.
What you want to do cannot be accomplished using the AMQP backend because it does not store results, it forwards them.
My suggestion would be to switch to a database backend so that the results are in a database that you can query outside of the existing celery modules. If no tasks exist in the result database you can assume the ID is invalid.

So I have this idea:
import project.celery_tasks as tasks
def task_exist(task_id):
found = False
# tasks is my imported task module from celery
# it is located under /project/project, where the settings.py file is located
i = tasks.app.control.inspect()
s = i.scheduled()
for e in s:
if task_id in s[e]:
found = True
break
a = i.active()
if not found:
for e in a:
if task_id in a[e]:
found = True
break
r = i.reserved()
if not found:
for e in r:
if task_id in r[e]:
found = True
break
# if checking the status returns pending, yet we found it in any queues... it means it exists...
# if it returns pending, yet we didn't find it on any of the queues... it doesn't exist
return found
According to https://docs.celeryproject.org/en/stable/userguide/monitoring.html the different types of queue inspections are:
active,
scheduled,
reserved,
revoked,
registered,
stats,
query_task,
so pick and choose as you please.
And there might be a better way to go about checking the queues for their tasks, but this should work for me, for now.

Try
AsyncResult('blubb').state
that may work.
It should return something different.

Please correct me if i'm wrong.
if built_in_status_check(task_id) == 'pending'
if registry_exists(task_id) == true
print 'Pending'
else
print 'Task does not exist'

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.