When adding a certain task to the task queue, I would like to make sure there is only one such task. If this task already exists, I would like to delete it and add the new task instead (postponing its execution is ok also). This is my code:
queue = taskqueue.Queue()
queue.delete_tasks_by_name('task_name')
task = taskqueue.Task(
name = 'task_name',
url = '/task/url',
method = 'GET',
countdown = 3600)
queue.add(task)
When running the code it raises a TombstonedTaskError, which make sense according to the docs. Is there a way to replace or postpone execution of an existing task?
Use tags instead of names. Give the tag a unique name then do a lease_task_by_tag to see if it exists.
add(taskqueue.Task(payload='parse1', method='PULL', tag='parse'))
lease_tasks_by_tag(lease_seconds, max_tasks, tag=None, deadline=10)
Related
I have a simple dag - that takes argument from mysql db - (like sql, subject)
Then I have a function creating report out and send to particular email.
Here is code snippet.
def s_report(k,**kwargs):
body_sql = list2[k][4]
request1 = "({})".format(body_sql)
dwh_hook = SnowflakeHook(snowflake_conn_id="snowflake_conn")
df1 = dwh_hook.get_pandas_df(request1)
df2 = df1.to_html()
body_Text = list2[k][3]
html_content = f"""HI Team, Please find report<br><br>
{df2} <br> </br>
<b>Thank you!</b><br>
"""
return EmailOperator(task_id="send_email_snowflake{}".format(k), to=list2[k][1],
subject=f"{list2[k][2]}", html_content=html_content, dag=dag)
for j in range(len(list)):
mysql_list >> [ s_report(j)] >> end_operator
The s_report is getting generated dynamically, But the real problem is hook is continously submitting query in backend, While dag is stopped still its submitting query in backend.
I can use pythonoperator, but its not generating dynamic task.
A couple of things:
By looking at your code, in particular the lines:
for j in range(len(list)):
mysql_list >> [ s_report(j)] >> end_operator
we can determine that if your first task succeeds, namely, mysql_list, then the tasks downstream to it, namely, the s_report calls should begin executing. You have precisely len(list) of them. Within each s_report call there is exactly one dwh_hook.get_pandas_df(request) call, so I believe your DAG should be making len(list) calls of this type provided mysql_list task succeeds.
As for the mismatch you see in your Snowflake logs, I can't advise you here. I'd need more details. Keep in mind that the call get_pandas_df might have a retry mechanism (i.e. if cannot reach snowflake, retry) which might explain why your Snowflake logs show a bunch of requests.
If your DAG finishes successfully, (i.e. end_operator tasks finishes successfully), you are correct. There should be no requests in your Snowflake logs that came post-DAG end.
If you want more insight as to how your DAG interacts with your Snowflake resource, I'd suggest having a single s_report task like so:
mysql_list >> [ s_report(0)] >> end_operator
and see the behaviour in the logs.
I wish to send a email with information of the output from my airflow DAG on success. My two approach where first to execute a function from the main DAG (which I could't do as is seems the DAG is unnable to access outputs of what it runs, which I find logical) and the second more promising, to configure it to send the log of the DAG run. I created it following this idea in airflow, but while it sends emails, they're blank. I tried finding the log and creating a list to pass as a message as follows:
def task_success_callback(context):
outer_task_success_callback(context, email='a1u1k6u3f0v1t0r8#justeat.slack.com')
def outer_task_success_callback(context, email):
lines = []
for file in glob.glob("AIRFLOW_HOME/*.log"):
with open(file) as f:
lines = [line for line in f.readlines()]
print(lines)
mensaje = lines
subject = "[Airflow] DAG {0} - Task {1}: Success".format(
context['task_instance_key_str'].split('__')[0],
context['task_instance_key_str'].split('__')[1]
)
html_content = """
DAG: {0}<br>
Task: {1}<br>
Log: {2}<br>
""".format(
context['task_instance_key_str'].split('__')[0],
context['task_instance_key_str'].split('__')[1],
mensaje
)
It didnt seem to produce anything. I did not even return an error. Even weirder, now in the airflow log it doesnt even refer to the event "email send" which it used to do
First: pass messages
To send mail about DAG running info, you need two core airflow components to support you.
They are XCOM and Python operator =>provide_context
XCOM: it is used for exchange information among tasks. The message could be push and pull from Xcom.
This is a subtle but very important point: in general, if two
operators need to share information, like a filename or small amount
of data, you should consider combining them into a single operator. If
it absolutely can’t be avoided, Airflow does have a feature for
operator cross-communication called XCom that is described in the
section XComs
https://airflow.apache.org/docs/stable/concepts.html?highlight=xcom
provide_context (bool)
– if set to true, Airflow will pass a set of keyword arguments that
can be used in your function. This set of kwargs correspond exactly to
what you can use in your jinja templates. For this to work, you need
to define **kwargs in your function header.
https://airflow.apache.org/docs/stable/_api/airflow/operators/python_operator/index.html?highlight=provide_context
Second: run email task after task successed.
There are many way. One is depended on the DAG level. Another is depended on the task level. You can double check about your logic
I'm working on a django application which reads csv file from dropbox, parse data and store it in database. For this purpose I need background task which checks if the file is modified or changed(updated) and then updates database.
I've tried 'Celery' but failed to configure it with django. Then I find django-background-tasks which is quite simpler than celery to configure.
My question here is how to initialize repeating tasks?
It is described in documentation
but I'm unable to find any example which explains how to use repeat, repeat_until or other constants mentioned in documentation.
can anyone explain the following with examples please?
notify_user(user.id, repeat=<number of seconds>, repeat_until=<datetime or None>)
repeat is given in seconds. The following constants are provided:
Task.NEVER (default), Task.HOURLY, Task.DAILY, Task.WEEKLY,
Task.EVERY_2_WEEKS, Task.EVERY_4_WEEKS.
You have to call the particular function (notify_user()) when you really need to execute it.
Suppose you need to execute the task while a request comes to the server, then it would be like this,
#background(schedule=60)
def get_csv(creds):
#read csv from drop box with credentials, "creds"
#then update the DB
def myview(request):
# do something with my view
get_csv(creds, repeat=100)
return SomeHttpResponse
Excecution Procedure
1. Request comes to the url hence it would dispatch to the corresponding view, here myview()
2. Excetes the line get_csv(creds, repeat=100) and then creates a async task in DB (it wont excetute the function now)
3. Returning the HTTP response to the user.
After 60 seconds from the time which the task creation, get_csv(creds) will excecutes repeatedly in every 100 seconds
For example, suppose you have the function from the documentation
#background(schedule=60)
def notify_user(user_id):
# lookup user by id and send them a message
user = User.objects.get(pk=user_id)
user.email_user('Here is a notification', 'You have been notified')
Suppose you want to repeat this task daily until New Years day of 2019 you would do the following
import datetime
new_years_2019 = datetime.datetime(2019, 01, 01)
notify_user(some_id, repeat=task.DAILY, repeat_until=new_years_2019)
I have a Django app saving objects to the database and a celery task that periodically does some processing on some of those objects. The problem is that the user can delete an object after it has been selected by the celery task for processing, but before the celery task has actually finished processing and saving it. So when the celery task does call .save(), the object re-appears in the database even though the user deleted it. Which is really spooky for users, of course.
So here's some code showing the problem:
def my_delete_view(request, pk):
thing = Thing.objects.get(pk=pk)
thing.delete()
return HttpResponseRedirect('yay')
#app.task
def my_periodic_task():
things = get_things_for_processing()
# if the delete happens anywhere between here and the .save(), we're hosed
for thing in things:
process_thing(thing) # could take a LONG time
thing.save()
I thought about trying to fix it by adding an atomic block and a transaction to test if the object actually exists before saving it:
#app.task
def my_periodic_task():
things = Thing.objects.filter(...some criteria...)
for thing in things:
process_thing(thing) # could take a LONG time
try:
with transaction.atomic():
# just see if it still exists:
unused = Thing.objects.select_for_update().get(pk=thing.pk)
# no exception means it exists. go ahead and save the
# processed version that has all of our updates.
thing.save()
except Thing.DoesNotExist:
logger.warning("Processed thing vanished")
Is this the correct pattern to do this sort of thing? I mean, I'll find out if it works within a few days of running it in production, but it would be nice to know if there are any other well-accepted patterns for accomplishing this sort of thing.
What I really want is to be able to update an object if it still exists in the database. I'm ok with the race between user edits and edits from the process_thing, and I can always throw in a refresh_from_db just before the process_thing to minimize the time during which user edits would be lost. But I definitely can't have objects re-appearing after the user has deleted them.
if you open a transaction for the time of processing of celery task, you should avoid such a problems:
#app.task
#transaction.atomic
def my_periodic_task():
things = get_things_for_processing()
# if the delete happens anywhere between here and the .save(), we're hosed
for thing in things:
process_thing(thing) # could take a LONG time
thing.save()
sometimes, you would like to report to the frontend, that you are working on the data, so you can add select_for_update() to your queryset (most probably in get_things_for_processing), then in the code responsible for deletion you need to handle errors when db will report that specific record is locked.
For now, it seems like the pattern of "select again atomically, then save" is sufficient:
#app.task
def my_periodic_task():
things = Thing.objects.filter(...some criteria...)
for thing in things:
process_thing(thing) # could take a LONG time
try:
with transaction.atomic():
# just see if it still exists:
unused = Thing.objects.select_for_update().get(pk=thing.pk)
# no exception means it exists. go ahead and save the
# processed version that has all of our updates.
thing.save()
except Thing.DoesNotExist:
logger.warning("Processed thing vanished")
(this is the same code as in my original question).
Is it possible to find out whether a task with a certain task id exists? When I try to get the status, I will always get pending.
>>> AsyncResult('...').status
'PENDING'
I want to know whether a given task id is a real celery task id and not a random string. I want different results depending on whether there is a valid task for a certain id.
There may have been a valid task in the past with the same id but the results may have been deleted from the backend.
Celery does not write a state when the task is sent, this is partly an optimization (see the documentation).
If you really need it, it's simple to add:
from celery import current_app
# `after_task_publish` is available in celery 3.1+
# for older versions use the deprecated `task_sent` signal
from celery.signals import after_task_publish
# when using celery versions older than 4.0, use body instead of headers
#after_task_publish.connect
def update_sent_state(sender=None, headers=None, **kwargs):
# the task may not exist if sent using `send_task` which
# sends tasks by name, so fall back to the default result backend
# if that is the case.
task = current_app.tasks.get(sender)
backend = task.backend if task else current_app.backend
backend.store_result(headers['id'], None, "SENT")
Then you can test for the PENDING state to detect that a task has not (seemingly)
been sent:
>>> result.state != "PENDING"
AsyncResult.state returns PENDING in case of unknown task ids.
PENDING
Task is waiting for execution or unknown. Any task id that is not
known is implied to be in the pending state.
http://docs.celeryproject.org/en/latest/userguide/tasks.html#pending
You can provide custom task ids if you need to distinguish unknown ids from existing ones:
>>> from tasks import add
>>> from celery.utils import uuid
>>> r = add.apply_async(args=[1, 2], task_id="celery-task-id-"+uuid())
>>> id = r.task_id
>>> id
'celery-task-id-b774c3f9-5280-4ebe-a770-14a6977090cd'
>>> if not "blubb".startswith("celery-task-id-"): print "Unknown task id"
...
Unknown task id
>>> if not id.startswith("celery-task-id-"): print "Unknown task id"
...
Right now I'm using following scheme:
Get task id.
Set to memcache key like 'task_%s' % task.id message 'Started'.
Pass task id to client.
Now from client I can monitor task status(set from task messages to memcache).
From task on ready - set to memcache key message 'Ready'.
From client on task ready - start special task that will delete key from memcache and do necessary cleaning actions.
You need to call .get() on the AsyncTask object you create to actually fetch the result from the backend.
See the Celery FAQ.
To further clarify on my answer.
Any string is technically a valid ID, there is no way to validate the task ID. The only way to find out if a task exists is to ask the backend if it knows about it and to do that you must use .get().
This introduces the problem that .get() blocks when the backend doesn't have any information about the task ID you supplied, this is by design to allow you to start a task and then wait for its completion.
In the case of the original question I'm going to assume that the OP wants to get the state of a previously completed task. To do that you can pass a very small timeout and catch timeout errors:
from celery.exceptions import TimeoutError
try:
# fetch the result from the backend
# your backend must be fast enough to return
# results within 100ms (0.1 seconds)
result = AsyncResult('blubb').get(timeout=0.1)
except TimeoutError:
result = None
if result:
print "Result exists; state=%s" % (result.state,)
else:
print "Result does not exist"
It should go without saying that this only work if your backend is storing results, if it's not there's no way to know if a task ID is valid or not because nothing is keeping a record of them.
Even more clarification.
What you want to do cannot be accomplished using the AMQP backend because it does not store results, it forwards them.
My suggestion would be to switch to a database backend so that the results are in a database that you can query outside of the existing celery modules. If no tasks exist in the result database you can assume the ID is invalid.
So I have this idea:
import project.celery_tasks as tasks
def task_exist(task_id):
found = False
# tasks is my imported task module from celery
# it is located under /project/project, where the settings.py file is located
i = tasks.app.control.inspect()
s = i.scheduled()
for e in s:
if task_id in s[e]:
found = True
break
a = i.active()
if not found:
for e in a:
if task_id in a[e]:
found = True
break
r = i.reserved()
if not found:
for e in r:
if task_id in r[e]:
found = True
break
# if checking the status returns pending, yet we found it in any queues... it means it exists...
# if it returns pending, yet we didn't find it on any of the queues... it doesn't exist
return found
According to https://docs.celeryproject.org/en/stable/userguide/monitoring.html the different types of queue inspections are:
active,
scheduled,
reserved,
revoked,
registered,
stats,
query_task,
so pick and choose as you please.
And there might be a better way to go about checking the queues for their tasks, but this should work for me, for now.
Try
AsyncResult('blubb').state
that may work.
It should return something different.
Please correct me if i'm wrong.
if built_in_status_check(task_id) == 'pending'
if registry_exists(task_id) == true
print 'Pending'
else
print 'Task does not exist'