I am able to successfully implement and test on_success_callback and on_failure_callback in Apache Airflow including successfully able to pass parameters to them using context object. However I am not able to successfully implement sla_miss_callback . By going through different online sources I found that arguments that get passed on to this function is
dag, task_list, blocking_task_list, slas, blocking_tis
However the sla_miss_callback unlike success/failure callback doesn't get the context object in its argument list and if I am trying to run Multiple set of operators like Python, Bash Operators they fail and scheduler complains for not passing context to execute function.
I tried looking at other online sources and in just one (https://www.rea-group.com/blog/watching-the-watcher/) I found that we can extract context object by using the self object . So I appended self to the additional 5 arguments described above but it didn't work for me. I want to know how is it possible to retrieve or pass context object to sla_miss_callback function not only for running different operators but also retrieving other details about the dag which has missed the SLA
It seems it is not possible to pass the context dictionary to the SLA callback (see source code for sla_miss_callback) but I've found a reasonable workaround to access some other information about the dag-run such as dag_id, task_id, and execution_date. You can also use any of the build-in macros/parameters which should work fine. While I am using the SlackWebhookOperator for my other callbacks, I am using SlackWebhookHook for the sla_miss_callback. For example:
from airflow.providers.slack.hooks.slack_webhook import SlackWebhookHook
def sla_miss_callback(dag, task_list, blocking_task_list, slas, blocking_tis, *args, **kwargs):
dag_id = slas[0].dag_id
task_id = slas[0].task_id
execution_date = slas[0].execution_date.isoformat()
hook = SlackWebhookHook(
http_conn_id='slack_callbacks',
webhook_token=BaseHook.get_connection('slack_callbacks').password,
message=f"""
:sos: *SLA has been missed*
*Task:* {task_id}
*DAG:* {dag_id}
*Execution Date:* {execution_date}
"""
)
hook.execute()
Related
I have Apache Beam pipeline that reads data from Google Cloud Datastore. Pipeline is ran in Google Cloud Dataflow in batch mode and it is written in Python.
Problem is with templated argument which I'm trying to use to create Datastore query with dynamic timestamp filter.
Pipeline is defined as follows:
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.io.gcp.datastore.v1new.datastoreio import ReadFromDatastore
from apache_beam.io.gcp.datastore.v1new.types import Query
class UserOptions(PipelineOptions):
#classmethod
def _add_argparse_args(cls, parser):
parser.add_value_provider_argument('--filter', type=int)
pipeline_options = PipelineOptions()
with beam.Pipeline(options=pipeline_options) as p:
user_options = pipeline_options.view_as(UserOptions)
data = (p
| 'Read' >> ReadFromDatastore(build_query(user_options.filter.get()))
| ...
And build_query function as follows:
def build_query(filter):
return Query(
kind='Kind',
filters=[('timestamp', '>', filter)],
project='Project'
)
Running this leads to error RuntimeValueProvider(...).get() not called from a runtime context.
I have also tried ReadFromDatastore(build_query(user_options.filter)) But then error is ValueError: (u"Unknown protobuf attr type [while running 'Read/Read']", <class 'apache_beam.options.value_provider.RuntimeValueProvider'>).
Everything works just fine if templated argument is removed from equation eg. like this: ReadFromDatastore(build_query(1563276063)). So the problem is with using templated argument while building Datastore query.
My guess is that build_query should be defined some other way but after spending some time with documentation and googling I still have no idea how.
Any suggestions how I could solve this are highly appreciated!
EDIT 1
Actually, in this case filter is always relative to current timestamp so passing it as an argument is probably not even necessary if there is some other way to use dynamic values. Tried with ReadFromDatastore(build_query(int(time())-90000)) but two consecutive runs contained exactly same filter.
Value providers need to be supported by the source you're using. Only there can it be unpacked at the right moment.
When creating your own source you have full control over this obviously. When using a pre-existing source I only see two options:
Provide the value at template creation, meaning don't use a template argument for it
Create a PR for the pre-existing source to support template arguments
I am using cherrypy as a web server, and I want to check a user's logged-in status before returning the page. This works on methods in the main Application class (in site.py) but gives an error when I call the same decorated function on method in a class that is one layer deeper in the webpage tree (in a separate file).
validate_user() is the function used as a decorator. It either passes a user to the page or sends them to a 401 restricted page, as a cherrypy.Tool, like this:
from user import validate_user
cherrypy.tools.validate_user = cherrypy.Tool('before_handler', validate_user)
I attach different sections of the site to the main site.py file's Application class by assigning instances of the sub-classes as variables accordingly:
from user import UserAuthentication
class Root:
user = UserAuthentication() # maps user/login, user/register, user/logout, etc
admin = Admin()
api = Api()
#cherrypy.expose
#cherrypy.tools.validate_user()
def how_to(self, **kw):
from other_stuff import how_to_page
return how_to_page(kw)
This, however, does not work when I try to use the validate_user() inside the Admin or Api or Analysis sections. These are in separate files.
import cherrypy
class Analyze:
#cherrypy.expose
#cherrypy.tools.validate_user() #### THIS LINE GIVES ERROR ####
def explore(self, *args, **kw): # #addkw(fetch=['uid'])
import explore
kw['uid'] = cherrypy.session.get('uid',-1)
return explore.explorer(args, kw)
The error is that cherrypy.tools doesn't have a validate_user function or method. But other things I assign in site.py do appear in cherrypy here. What's the reason why I can't use this tool in a separate file that is part of my overall site map?
If this is relevant, the validate_user() function simply looks at the cherrypy.request.cookie, finds the 'session_token' value, and compares it to our database and passes it along if the ID matches.
Sorry I don't know if the Analyze() and Api() and User() pages are subclasses, or nested classes, or extended methods, or what. So I can't give this a precise title. Do I need to pass in the parent class to them somehow?
The issue here is that Python processes everything except the function/method bodies during import. So in site.py, when you import user (or from user import <anything>), that causes all of the user module to be processed before the Python interpreter has gotten to the definition of the validate_user tool, including the decorator, which is attempting to access that tool by value (rather than by a reference).
CherryPy has another mechanism for decorating functions with config that will enable tools on those handlers. Instead of #cherrypy.tools.validate_user, use:
#cherrypy.config(**{"tools.validate_user.on": True})
This decorator works because instead of needing to access validate_user from cherrypy.tools to install itself on the handler, it instead configures CherryPy to install that tool on the handler later, when the handler is invoked.
If that tool is needed for all methods on that class, you can use that config decorator on the class itself.
You could alternatively, enable that tool for given endpoints in the server config, as mentioned in the other question.
I am using APScheduler and I need to add jobs with a programmatically created list of trigger options. That is, I can't write code where I pass trigger parameters directly to add_job (such as "second"="*/5" etc.).
The documentation mentions that you can create a trigger instance and pass that to add_job as the trigger parameter, instead of "cron" or "interval", etc.
I would like to try to do that, as it appears that the trigger constructor takes kwargs style parameters and I should be able to pass it a dictionary.
I have not found an example of how to do this. I have tried:
from apscheduler.triggers import cron
# skipping irrelevant code
class Schedules(object):
# skipping irrelevant code
def add_schedule(self, new_schedule):
# here I create trigger_args as {'second': '*/5'}, for example
trigger = cron(trigger_args)
This fails with: TypeError: 'module' object is not callable
How do I instantiate a trigger object?
I found a solution to my main problem without figuring out how to create a trigger instance (though I am still curious as to how to do that).
The main issue I had is that I need to programmatically create the trigger parameters. Knowing a bit more now about parameter passing in python, I see that if I make a dict of all the parameters, not just the trigger parameters, I can pass the parameters this way:
job_def = {}
#here I programmatically create the trigger parameters and add them to the dict
job_def["func"] = job_function
job_def["trigger"] = "cron"
job_def["args"] = [3]
new_job = self.scheduler.add_job(**job_def )
I have several Celery tasks I'm executing within a Django view (more specifically within Django Rest Framework's perform_create method).
What I'm trying to achieve is to immediately (that is, as soon as the task has an id/is in the results backend) access the TaskResult object and do something with it, like this:
tasks = [do_something.s(a) for a in (1, 2, 3, 4,)]
results = group(*tasks).apply_async()
for result in results.children:
task = TaskResult.objects.get(task_id=result.task_id)
do_something_with_task_object(task)
Now, this fails with django_celery_results.models.DoesNotExist: TaskResult matching query does not exist.
I did not yet try it, but I could make this work with something like the following snippet. But that strikes me as plain wrong and ugly, also does it wait until the tasks are finished:
while not all([TaskResult.objects.filter(task_id=t.task_id).exists() for t in results.children]):
pass
Is there some way to make this work in a nice and clean fashion?
It turns out that a) the moment you ask a question on StackOverflow, you're able to answer it yourself and b) Django transaction management does everything you need.
If you wrap the call to task.apply_async in an atomic wrapper all is fine, e.g.
with transactions.atomic():
results = group(*tasks).apply_async()
TaskResult.objects.get(task_id=results.children[0].task_id)
I don't know if it worked for everyone, but with django-celery-results==2.2.0, the transaction as a context manager doesn't seem to work anymore.
On the other hand, in a post_save signal, it seems ok.
# models.py
#receiver(post_save, sender=TaskResult)
def after_task_result(sender, instance, created, **kwargs):
if created: transaction.on_commit(lambda x:do_something())
However, I lose the variables in the view that are not passed in the model creation with signal. In this case, it is still the ugly code that works best.
# views.py
while not TaskResult.objects.filter(task_id = task.id).exists(): pass
task = TaskResult.objects.get(task_id = task.id)
# do something more complex with local variables
I'd like to execute a custom callback function once a mapreduce job has finalized/completed.
The only useful references I found for this problem are a somewhat outdated Google site and a related, but again seemingly outdated Stackoverflow question.
Both those sources assume that I use control.start_map to kick off Mapreduce jobs, and rely on the fact that start_map takes a keyword argument mapreduce_parameters in which one can specify a done_callback argument to specify the url which should be called on completion. However, I'm using a different method (afaik the more recent, preferred one) in which a custom pipeline's run method yields a Mapreduce pipeline:
yield mapreduce_pipeline.MapreducePipeline(
"word_count",
"main.word_count_map",
"main.word_count_reduce",
"mapreduce.input_readers.BlobstoreZipInputReader",
"mapreduce.output_writers.BlobstoreOutputWriter",
mapper_params={
"blob_key": blobkey,
},
reducer_params={
"mime_type": "text/plain",
},
shards=16)
The signature for MapreducePipeline doesn't allow for a mapreduce_parameters argument. The only places where I can see references to callback cropping up in the source is in mapper_pipeline.MapperPipeline.run, but it seems to be used internally only.
So, is there a way to get that callback parameter in there?
If not, does someone have good ideas on where and how to extend the library to provide such a functionality?
I set up my Mapreduce pipeline paradigm to look a little like the following:
class MRRecalculateSupportsPipeline(base_handler.PipelineBase):
def run(self, user_key):
# ...
yield mapreduce_pipeline.MapreducePipeline('user_recalculate_supports',
'myapp.mapreduces.user_recalculate_supports_map',
'myapp.mapreduces.user_recalculate_supports_reduce',
'mapreduce.input_readers.DatastoreInputReader', output_writer_spec=None,
mapper_params={"""..."""})
If you would like to capture the completion of this pipeline you have two options.
A) Use pipeline.After to run a completion pipeline after the MR pipeline completes.
pipe_future = yield mapreduce_pipeline.MapreducePipeline('user_recalculate_supports',
'myapp.mapreduces.user_recalculate_supports_map',
'myapp.mapreduces.user_recalculate_supports_reduce',
'mapreduce.input_readers.DatastoreInputReader', output_writer_spec=None,
mapper_params={"""..."""})
with pipeline.After(pipe_future):
yield CalcCompletePipeline(...) # this could be a mapreduce pipeline, or any pipeline using the same base_handler.PipelineBase parent class.
B) Use the finalized method of the top-level pipeline to handle completion. Personally, I'd stick with option A, because you can trace the path in /_ah/*/status?root= view.
class EmailNewReleasePipeline(base_handler.PipelineBase):
"""Email followers about a new release"""
# TODO: product_key is the name of the parameter, but it's built for albums ...
def run(self, product_key, testing=False):
# Send those emails ...
yield mapreduce_pipeline.MapreducePipeline(...)
def finalized(self):
"""Save product as launched"""
...
product.launched = True
product.put()
Here are the docs on the finalization of a pipeline.
At least a not-so-much-investment workaround for this issue is to simply yield another Map/Mapreduce pipeline that does the desired postprocessing.
E.g.:
class MainPipeline(base_handler.PipelineBase):
def run(self):
mapper_params = { ... }
reducer_params = { ... }
yield mapreduce_pipeline.MapReducePipeline(
...,
mapper_params=mapper_params,
reducer_params=reducer_params)
yield PostprocessPipeline(reducer_params)
class PostprocessPipeline(base_handler.PipelineBase):
def run(self, reducer_params):
do_some_postprocessing(reducer_params)
That workaround doesn't have access to the Mapreduce state, which I suppose could somehow be retrieved from the pipeline ID, but it's not yet obvious to me how. So, you'll have to set another flag/memcache/ds entry to check if the pipeline was completed successfully or not (if that's relevant to the postprocessing).