Mapreduce on GAE Python - Cause ReducePipeline to issue callback on finalize? - python

I'd like to execute a custom callback function once a mapreduce job has finalized/completed.
The only useful references I found for this problem are a somewhat outdated Google site and a related, but again seemingly outdated Stackoverflow question.
Both those sources assume that I use control.start_map to kick off Mapreduce jobs, and rely on the fact that start_map takes a keyword argument mapreduce_parameters in which one can specify a done_callback argument to specify the url which should be called on completion. However, I'm using a different method (afaik the more recent, preferred one) in which a custom pipeline's run method yields a Mapreduce pipeline:
yield mapreduce_pipeline.MapreducePipeline(
"word_count",
"main.word_count_map",
"main.word_count_reduce",
"mapreduce.input_readers.BlobstoreZipInputReader",
"mapreduce.output_writers.BlobstoreOutputWriter",
mapper_params={
"blob_key": blobkey,
},
reducer_params={
"mime_type": "text/plain",
},
shards=16)
The signature for MapreducePipeline doesn't allow for a mapreduce_parameters argument. The only places where I can see references to callback cropping up in the source is in mapper_pipeline.MapperPipeline.run, but it seems to be used internally only.
So, is there a way to get that callback parameter in there?
If not, does someone have good ideas on where and how to extend the library to provide such a functionality?

I set up my Mapreduce pipeline paradigm to look a little like the following:
class MRRecalculateSupportsPipeline(base_handler.PipelineBase):
def run(self, user_key):
# ...
yield mapreduce_pipeline.MapreducePipeline('user_recalculate_supports',
'myapp.mapreduces.user_recalculate_supports_map',
'myapp.mapreduces.user_recalculate_supports_reduce',
'mapreduce.input_readers.DatastoreInputReader', output_writer_spec=None,
mapper_params={"""..."""})
If you would like to capture the completion of this pipeline you have two options.
A) Use pipeline.After to run a completion pipeline after the MR pipeline completes.
pipe_future = yield mapreduce_pipeline.MapreducePipeline('user_recalculate_supports',
'myapp.mapreduces.user_recalculate_supports_map',
'myapp.mapreduces.user_recalculate_supports_reduce',
'mapreduce.input_readers.DatastoreInputReader', output_writer_spec=None,
mapper_params={"""..."""})
with pipeline.After(pipe_future):
yield CalcCompletePipeline(...) # this could be a mapreduce pipeline, or any pipeline using the same base_handler.PipelineBase parent class.
B) Use the finalized method of the top-level pipeline to handle completion. Personally, I'd stick with option A, because you can trace the path in /_ah/*/status?root= view.
class EmailNewReleasePipeline(base_handler.PipelineBase):
"""Email followers about a new release"""
# TODO: product_key is the name of the parameter, but it's built for albums ...
def run(self, product_key, testing=False):
# Send those emails ...
yield mapreduce_pipeline.MapreducePipeline(...)
def finalized(self):
"""Save product as launched"""
...
product.launched = True
product.put()
Here are the docs on the finalization of a pipeline.

At least a not-so-much-investment workaround for this issue is to simply yield another Map/Mapreduce pipeline that does the desired postprocessing.
E.g.:
class MainPipeline(base_handler.PipelineBase):
def run(self):
mapper_params = { ... }
reducer_params = { ... }
yield mapreduce_pipeline.MapReducePipeline(
...,
mapper_params=mapper_params,
reducer_params=reducer_params)
yield PostprocessPipeline(reducer_params)
class PostprocessPipeline(base_handler.PipelineBase):
def run(self, reducer_params):
do_some_postprocessing(reducer_params)
That workaround doesn't have access to the Mapreduce state, which I suppose could somehow be retrieved from the pipeline ID, but it's not yet obvious to me how. So, you'll have to set another flag/memcache/ds entry to check if the pipeline was completed successfully or not (if that's relevant to the postprocessing).

Related

How to get context object in sla_miss_callback function

I am able to successfully implement and test on_success_callback and on_failure_callback in Apache Airflow including successfully able to pass parameters to them using context object. However I am not able to successfully implement sla_miss_callback . By going through different online sources I found that arguments that get passed on to this function is
dag, task_list, blocking_task_list, slas, blocking_tis
However the sla_miss_callback unlike success/failure callback doesn't get the context object in its argument list and if I am trying to run Multiple set of operators like Python, Bash Operators they fail and scheduler complains for not passing context to execute function.
I tried looking at other online sources and in just one (https://www.rea-group.com/blog/watching-the-watcher/) I found that we can extract context object by using the self object . So I appended self to the additional 5 arguments described above but it didn't work for me. I want to know how is it possible to retrieve or pass context object to sla_miss_callback function not only for running different operators but also retrieving other details about the dag which has missed the SLA
It seems it is not possible to pass the context dictionary to the SLA callback (see source code for sla_miss_callback) but I've found a reasonable workaround to access some other information about the dag-run such as dag_id, task_id, and execution_date. You can also use any of the build-in macros/parameters which should work fine. While I am using the SlackWebhookOperator for my other callbacks, I am using SlackWebhookHook for the sla_miss_callback. For example:
from airflow.providers.slack.hooks.slack_webhook import SlackWebhookHook
def sla_miss_callback(dag, task_list, blocking_task_list, slas, blocking_tis, *args, **kwargs):
dag_id = slas[0].dag_id
task_id = slas[0].task_id
execution_date = slas[0].execution_date.isoformat()
hook = SlackWebhookHook(
http_conn_id='slack_callbacks',
webhook_token=BaseHook.get_connection('slack_callbacks').password,
message=f"""
:sos: *SLA has been missed*
*Task:* {task_id}
*DAG:* {dag_id}
*Execution Date:* {execution_date}
"""
)
hook.execute()

How to build a wrapper pytest plugin?

I want to wrap the pytest-html plugin in the following way:
Add an option X
Given the option X, delete data from the report
I was able to add the option with implementing the pytest_addoption(parser) function, but got stuck on the 2nd thing...
What I was able to do is this: implement a hook frmo pytest-html. However, I have to access my option X, in order to do what to do. The problem is, pytest-html's hook does not give the "request" object as a param, so I can't access the option value...
Can I have additional args for a hook? or something like this?
You can attach additional data to the report object, for example via a custom wrapper around the pytest_runtest_makereport hook:
#pytest.hookimpl(hookwrapper=True)
def pytest_runtest_makereport(item, call):
outcome = yield
report = outcome.get_result()
report.config = item.config
Now the config object will be accessible via report.config in all reporting hooks, including the ones of pytest-html:
def pytest_html_report_title(report):
""" Called before adding the title to the report """
assert report.config is not None

Initializing state on dask-distributed workers

I am trying to do something like
resource = MyResource()
def fn(x):
something = dosemthing(x, resource)
return something
client = Client()
results = client.map(fn, data)
The issue is that resource is not serializable and is expensive to construct.
Therefore I would like to construct it once on each worker and be available to be used by fn.
How do I do this?
Or is there some other way to make resource available on all workers?
You can always construct a lazy resource, something like
class GiveAResource():
resource = [None]
def get_resource(self):
if self.resource[0] is None:
self.resource[0] = MyResource()
return self.resource[0]
An instance of this will serialise between processes fine, so you can include it as an input to any function to be executed on workers, and then calling .get_resource() on it will get your local expensive resource (which will get remade on any worker which appears later on).
This class would be best defined in a module rather than dynamic code.
There is no locking here, so if several threads ask for the resource at the same time when it has not been needed so far, you will get redundant work.

python pytest for testing the requests and response

I am a beginner to using pytest in python and trying to write test cases for the following method which get the user address when correct Id is passed else rises custom error BadId.
def get_user_info(id: str, host='127.0.0.1', port=3000 ) -> str:
uri = 'http://{}:{}/users/{}'.format(host,port,id)
result = Requests.get(uri).json()
address = result.get('user',{}).get('address',None)
if address:
return address
else:
raise BadId
Can someone help me with this and also can you suggest me what are the best resources for learning pytest? TIA
Your test regimen might look something like this.
First I suggest creating a fixture to be used in your various method tests. The fixture sets up an instance of your class to be used in your tests rather than creating the instance in the test itself. Keeping tasks separated in this way helps to make your tests both more robust and easier to read.
from my_package import MyClass
import pytest
#pytest.fixture
def a_test_object():
return MyClass()
You can pass the test object to your series of method tests:
def test_something(a_test_object):
# do the test
However if your test object requires some resources during setup (such as a connection, a database, a file, etc etc), you can mock it instead to avoid setting up the resources for the test. See this talk for some helpful info on how to do that.
By the way: if you need to test several different states of the user defined object being created in your fixture, you'll need to parametrize your fixture. This is a bit of a complicated topic, but the documentation explains fixture parametrization very clearly.
The other thing you need to do is make sure any .get calls to Requests are intercepted. This is important because it allows your tests to be run without an internet connection, and ensures they do not fail as a result of a bad connection, which is not the thing you are trying to test.
You can intercept Requests.get by using the monkeypatch feature of pytest. All that is required is to include monkeypatch as an input parameter to the test regimen functions.
You can employ another fixture to accomplish this. It might look like this:
import Requests
import pytest
#pytest.fixture
def patched_requests(monkeypatch):
# store a reference to the old get method
old_get = Requests.get
def mocked_get(uri, *args, **kwargs):
'''A method replacing Requests.get
Returns either a mocked response object (with json method)
or the default response object if the uri doesn't match
one of those that have been supplied.
'''
_, id = uri.split('/users/', 1)
try:
# attempt to get the correct mocked json method
json = dict(
with_address1 = lambda: {'user': {'address': 123}},
with_address2 = lambda: {'user': {'address': 456}},
no_address = lambda: {'user': {}},
no_user = lambda: {},
)[id]
except KeyError:
# fall back to default behavior
obj = old_get(uri, *args, **kwargs)
else:
# create a mocked requests object
mock = type('MockedReq', (), {})()
# assign mocked json to requests.json
mock.json = json
# assign obj to mock
obj = mock
return obj
# finally, patch Requests.get with patched version
monkeypatch.setattr(Requests, 'get', mocked_get)
This looks complicated until you understand what is happening: we have simply made some mocked json objects (represented by dictionaries) with pre-determined user ids and addresses. The patched version of Requests.get simply returns an object- of type MockedReq- with the corresponding mocked .json() method when its id is requested.
Note that Requests will only be patched in tests that actually use the above fixture, e.g.:
def test_something(patched_requests):
# use patched Requests.get
Any test that does not use patched_requests as an input parameter will not use the patched version.
Also note that you could monkeypatch Requests within the test itself, but I suggest doing it separately. If you are using other parts of the requests API, you may need to monkeypatch those as well. Keeping all of this stuff separate is often going to be easier to understand than including it within your test.
Write your various method tests next. You'll need a different test for each aspect of your method. In other words, you will usually write a different test for the instance in which your method succeeds, and another one for testing when it fails.
First we test method success with a couple test cases.
#pytest.mark.parametrize('id, result', [
('with_address1', 123),
('with_address2', 456),
])
def test_get_user_info_success(patched_requests, a_test_object, id, result):
address = a_test_object.get_user_info(id)
assert address == result
Next we can test for raising the BadId exception using the with pytest.raises feature. Note that since an exception is raised, there is not a result input parameter for the test function.
#pytest.mark.parametrize('id', [
'no_address',
'no_user',
])
def test_get_user_info_failure(patched_requests, a_test_object, id):
from my_package import BadId
with pytest.raises(BadId):
address = a_test_object.get_user_info(id)
As posted in my comment, here also are some additional resources to help you learn more about pytest:
link
link
Also be sure to check out Brian Okken's book and Bruno Oliveira's book. They are both very helpful for learning pytest.

How to construct an APScheduler trigger class to pass to add_job?

I am using APScheduler and I need to add jobs with a programmatically created list of trigger options. That is, I can't write code where I pass trigger parameters directly to add_job (such as "second"="*/5" etc.).
The documentation mentions that you can create a trigger instance and pass that to add_job as the trigger parameter, instead of "cron" or "interval", etc.
I would like to try to do that, as it appears that the trigger constructor takes kwargs style parameters and I should be able to pass it a dictionary.
I have not found an example of how to do this. I have tried:
from apscheduler.triggers import cron
# skipping irrelevant code
class Schedules(object):
# skipping irrelevant code
def add_schedule(self, new_schedule):
# here I create trigger_args as {'second': '*/5'}, for example
trigger = cron(trigger_args)
This fails with: TypeError: 'module' object is not callable
How do I instantiate a trigger object?
I found a solution to my main problem without figuring out how to create a trigger instance (though I am still curious as to how to do that).
The main issue I had is that I need to programmatically create the trigger parameters. Knowing a bit more now about parameter passing in python, I see that if I make a dict of all the parameters, not just the trigger parameters, I can pass the parameters this way:
job_def = {}
#here I programmatically create the trigger parameters and add them to the dict
job_def["func"] = job_function
job_def["trigger"] = "cron"
job_def["args"] = [3]
new_job = self.scheduler.add_job(**job_def )

Categories