I have a celery task:
#task
def foo():
part1()
part2()
part3()
...that I'm breaking up into a chain of subtasks
#task
def foo():
#task
def part1():
...
#task
def part2():
...
#task
def part3():
...
chain(part1.s(), part2.s(), part3.s()).delay()
The subtasks are inner funcs because I don't want them executed outside the context of the parent task. Issue is my worker does not detect and/or register the inner tasks (I am using autoregister to discover apps and task modules). If I move them out to the same level in the module as the parent task foo, it works fine.
Does celery support inner functions as tasks? If so, how do I get workers to register them?
The problem with your code is that you get new definitions of part1 every time you call foo(). Also note that not a single part1 function is created until you call foo so it is not possible for celery to register any one of part functions to be created when it initializes a worker.
I think the following code is the closest to what you want.
def make_maintask():
#task
def subtask1():
print("do subtask")
# ...
#task
def _maintask():
chain(subtask1.si(), subtask2.si(), subtask3.si()).delay()
return _maintask
maintask = make_maintask()
This way, each definition of subtask and such is not visible from outside.
some comments
If all you want to do is hiding subtask, please think twice. The designer of the python language didn't believe that one needs a access control such as public and private as in java. It is a functionality that severely complicates a language with a dubious advantage. I think well-organized packages and modules and good names (such as adding underscores in front) can solve all your problems.
If all _maintask does is delegating subtasks to other subprocesses, you don't really need to define it as a celery task. Don't make a celery task call another celery task unless you really need it.
Related
I have a module written roughyl as follows:
class __Foo:
def __init__(self):
self.__client = CeateClient() # a library function
async def call_database(self):
self.__client.do_thing()
foo = __Foo()
This is designed so that the entire server only has one instance of this client manage, so foo is created once and is then used by various modules.
I am required to now use pytest to, well, test everything, but reading online I cannot find the correct way to path this situation.
The class method call_database sends an API request to a different service, which as part of the test, I want to not actually happen.
The suggestions I see talk about patching the class method itself, but also the class is private so I couldn't really do that.
The point is, when testing a function in a different module, i.e.:
from foo import foo
async def bar():
await foo.call_database()
When testing the function bar, I want the call to the database to not happen.
How would I go about doing that?
In my code, I use multiprocessing.Pool to run some code concurrently. Simplified code looks somewhat like this:
class Wrapper():
session: Session
def __init__(self):
self.session = requests.Session()
# Session initialization
def upload_documents(docs):
with Pool(4) as pool:
upload_file = partial(self.upload_document)
pool.starmap(upload_file, documents)
summary = create_summary(documents)
self.upload_document(summary)
def upload_document(doc):
self.post(doc)
def post(data):
self.session.post(self.url, data, other_params)
So basically sending documents via HTTP is parallelized. Now I want to test this code, and can't do it. This is my test:
#patch.object(Session, 'post')
def test_study_upload(self, post_mock):
response_mock = Mock()
post_mock.return_value = response_mock
response_mock.ok = True
with Wrapper() as wrapper:
wrapper.upload_documents(documents)
mc = post_mock.mock_calls
And in debug I can check the mock calls. There is one that looks valid, and it's the one uploading the summary, and a bunch of calls like call.json(), call.__len__(), call.__str__() etc.
There are no calls uploading documents. When I set breakpoint in upload_document method, I can see it is called once for each document, it works as expected. However, I can't test it, because I can't verify this behavior by mock. I assume it's because there are many processes calling on the same mock, but still - how can I solve this?
I use Python 3.6
The approach I would take here is to keep your test as granular as possible and mock out other calls. In this case you'd want to mock your Pool object and verify that it's calling what you're expecting, not actually rely on it to spin up child processes during your test. Here's what I'm thinking:
#patch('yourmodule.Pool')
def test_study_upload(self, mock_pool_init):
mock_pool_instance = mock_pool_init.return_value.__enter__.return_value
with Wrapper() as wrapper:
wrapper.upload_documents(documents)
# To get the upload file arg here, you'll need to either mock the partial call here,
# or actually call it and get the return value
mock_pool_instance.starmap.assert_called_once_with_args(upload_file, documents)
Then you'd want to take your existing logic and test your upload_document function separately:
#patch.object(Session, 'post')
def test_upload_file(self, post_mock):
response_mock = Mock()
post_mock.return_value = response_mock
response_mock.ok = True
with Wrapper() as wrapper:
wrapper.upload_document(document)
mc = post_mock.mock_calls
This gives you coverage both on your function that's creating and controlling your pool, and the function that's being called by the pool instance. Caveat this with I didn't test this, but am leaving some of it for you to fill in the blanks since it looks like it's an abbreviated version of the actual module in your original question.
EDIT:
Try this:
def test_study_upload(self):
def call_direct(func_var, documents):
return func_var(documents)
with patch('yourmodule.Pool.starmap', new=call_direct)
with Wrapper() as wrapper:
wrapper.upload_documents(documents)
This is patching out the starmap call so that it calls the function you pass in directly. It circumvents the Pool entirely; the bottom line being that you can't really dive into those subprocesses created by multiprocessing.
I was wondering if there is a standardized approach or best practice to scan/ autodiscover decorators like it is done here but also in several other libs like Django, Flask. Usually a decorator provides extra/ wrapped functionality right at the time the inner func is called.
In the example shown below but also in Flask/ Django (route decorators) the decorator is rather used to add overarching functionalities, e.g. spawning of a tcp client initially within the decorator logic and then call the inner func when there is a message received to process it.
Flask/ Django register for an url route where the inner func is only called later when the url is requested. All examples require an initial registration (scan/ discover) of decorator logic to also initially start the overarching functionality. To me this seems to be an alternative use of decorators and I would like to understand the best practice approach if there is any.
See Faust example below where decorator app.agent() automatically triggers a listening (kafka stream) client within asyncio event loop and incoming message is then processed by the inner function hello() later, only when there is a message received, requiring an initial check/ scan/ discovery of related decorator logic first at the start of the script.
import faust
class Greeting(faust.Record):
from_name: str
to_name: str
app = faust.App('hello-app', broker='kafka://localhost')
topic = app.topic('hello-topic', value_type=Greeting)
#app.agent(topic)
async def hello(greetings):
async for greeting in greetings:
print(f'Hello from {greeting.from_name} to {greeting.to_name}')
#app.timer(interval=1.0)
async def example_sender(app):
await hello.send(
value=Greeting(from_name='Faust', to_name='you'),
)
if __name__ == '__main__':
app.main()
Nothing is "discovered". When you import a module from a package, all of that code is executed. This is why we have if __name__ == '__main__' to stop certain code being executed on import. The decorators will be "discovered" when you run your code.
I think the Flask blueprint is a nice example. Here you can see how it registers the url endpoints when you import modules. All it's doing is appending to a list:
def route(self, rule, **options):
"""Like :meth:`Flask.route` but for a blueprint. The endpoint for the
:func:`url_for` function is prefixed with the name of the blueprint.
"""
def decorator(f):
endpoint = options.pop("endpoint", f.__name__)
self.add_url_rule(rule, endpoint, f, **options)
return f
return decorator
The code runs, the decorators are evaluated and they need only keep some internal list of all the functions they decorate. These are stored in the Blueprint object.
I'm trying to get to clean up some stuff after i kill a running task within celery. I'm currently hitting 2 problems:
1) Inside the task revoked function body, how can i get access to the parameters that the task function was called: so for example if the task is defined as:
#app.task()
def foo(bar, baz):
pass
How will i get access to bar and baz inside the task_revoked.connect code?
2) I want to kill a task only when it's state is anything but X. That means inspecting the task on one hand, and setting the state on the other. Inspecting the state could be done I guess, but I'm having difficulty getting my head around the context inside the task function body.
If I define foo like this:
#app.task(bound=True)
def foo(self, bar, baz):
pass
and call it from say.... Flask like foo(bar, baz), then I'll get an error that the third parameter is expected, which means the decorator does not add any context automatically through the self parameter.
the app is simply defined as celery.Celery()
Thanks in advance
You can get tasks args from request object.
from celery.signals import task_revoked
#task_revoked.connect
def my_task_revoked_handler(sender=None, body=None, *args, **kwargs):
print(kwargs['request'].args)
This prints arguments given to the task.
Update:
You have to use bind not bound.
#app.task(bind=True)
def foo(self, bar, baz):
Does celery purge/fail to copy instance variables when a task is handled by delay?
class MyContext(object):
a = 1
class MyTask(Task):
def run(self):
print self.context.a
from tasks import MyTask, MyContext
c = MyContext()
t = MyTask()
t.context = c
print t.context.a
#Shows 1
t.delay()
=====Worker Output
Task tasks.MyTask[d30e1c37-d094-4809-9f72-89ff37b81a85]
raised exception: AttributeError("'NoneType' object has no attribute 'a'",)
It looks like this has been asked before here, but I do not see an answer.
This doesn't work because the instance that actually runs isn't the same instance where you call the delay method. Every worker instantiates it's own singleton for each task.
In short, celery isn't designed for the task objects to carry data. Data should be passed to the task through the delay or apply_async methods. If the context object is simple and can be pickled just pass it to delay. If it's complex, a better approach may be to pass a database id so that the task can retrieve it in the worker.
http://docs.celeryproject.org/en/latest/userguide/tasks.html#instantiation
Also, note that in celery 2.5 delay and apply_async were class methods.