Attributes of Celery Tasks - python

Does celery purge/fail to copy instance variables when a task is handled by delay?
class MyContext(object):
a = 1
class MyTask(Task):
def run(self):
print self.context.a
from tasks import MyTask, MyContext
c = MyContext()
t = MyTask()
t.context = c
print t.context.a
#Shows 1
t.delay()
=====Worker Output
Task tasks.MyTask[d30e1c37-d094-4809-9f72-89ff37b81a85]
raised exception: AttributeError("'NoneType' object has no attribute 'a'",)
It looks like this has been asked before here, but I do not see an answer.

This doesn't work because the instance that actually runs isn't the same instance where you call the delay method. Every worker instantiates it's own singleton for each task.
In short, celery isn't designed for the task objects to carry data. Data should be passed to the task through the delay or apply_async methods. If the context object is simple and can be pickled just pass it to delay. If it's complex, a better approach may be to pass a database id so that the task can retrieve it in the worker.
http://docs.celeryproject.org/en/latest/userguide/tasks.html#instantiation
Also, note that in celery 2.5 delay and apply_async were class methods.

Related

How to use mock in function run in multiprocessing.Pool

In my code, I use multiprocessing.Pool to run some code concurrently. Simplified code looks somewhat like this:
class Wrapper():
session: Session
def __init__(self):
self.session = requests.Session()
# Session initialization
def upload_documents(docs):
with Pool(4) as pool:
upload_file = partial(self.upload_document)
pool.starmap(upload_file, documents)
summary = create_summary(documents)
self.upload_document(summary)
def upload_document(doc):
self.post(doc)
def post(data):
self.session.post(self.url, data, other_params)
So basically sending documents via HTTP is parallelized. Now I want to test this code, and can't do it. This is my test:
#patch.object(Session, 'post')
def test_study_upload(self, post_mock):
response_mock = Mock()
post_mock.return_value = response_mock
response_mock.ok = True
with Wrapper() as wrapper:
wrapper.upload_documents(documents)
mc = post_mock.mock_calls
And in debug I can check the mock calls. There is one that looks valid, and it's the one uploading the summary, and a bunch of calls like call.json(), call.__len__(), call.__str__() etc.
There are no calls uploading documents. When I set breakpoint in upload_document method, I can see it is called once for each document, it works as expected. However, I can't test it, because I can't verify this behavior by mock. I assume it's because there are many processes calling on the same mock, but still - how can I solve this?
I use Python 3.6
The approach I would take here is to keep your test as granular as possible and mock out other calls. In this case you'd want to mock your Pool object and verify that it's calling what you're expecting, not actually rely on it to spin up child processes during your test. Here's what I'm thinking:
#patch('yourmodule.Pool')
def test_study_upload(self, mock_pool_init):
mock_pool_instance = mock_pool_init.return_value.__enter__.return_value
with Wrapper() as wrapper:
wrapper.upload_documents(documents)
# To get the upload file arg here, you'll need to either mock the partial call here,
# or actually call it and get the return value
mock_pool_instance.starmap.assert_called_once_with_args(upload_file, documents)
Then you'd want to take your existing logic and test your upload_document function separately:
#patch.object(Session, 'post')
def test_upload_file(self, post_mock):
response_mock = Mock()
post_mock.return_value = response_mock
response_mock.ok = True
with Wrapper() as wrapper:
wrapper.upload_document(document)
mc = post_mock.mock_calls
This gives you coverage both on your function that's creating and controlling your pool, and the function that's being called by the pool instance. Caveat this with I didn't test this, but am leaving some of it for you to fill in the blanks since it looks like it's an abbreviated version of the actual module in your original question.
EDIT:
Try this:
def test_study_upload(self):
def call_direct(func_var, documents):
return func_var(documents)
with patch('yourmodule.Pool.starmap', new=call_direct)
with Wrapper() as wrapper:
wrapper.upload_documents(documents)
This is patching out the starmap call so that it calls the function you pass in directly. It circumvents the Pool entirely; the bottom line being that you can't really dive into those subprocesses created by multiprocessing.

Celery: can a task be created from inner function?

I have a celery task:
#task
def foo():
part1()
part2()
part3()
...that I'm breaking up into a chain of subtasks
#task
def foo():
#task
def part1():
...
#task
def part2():
...
#task
def part3():
...
chain(part1.s(), part2.s(), part3.s()).delay()
The subtasks are inner funcs because I don't want them executed outside the context of the parent task. Issue is my worker does not detect and/or register the inner tasks (I am using autoregister to discover apps and task modules). If I move them out to the same level in the module as the parent task foo, it works fine.
Does celery support inner functions as tasks? If so, how do I get workers to register them?
The problem with your code is that you get new definitions of part1 every time you call foo(). Also note that not a single part1 function is created until you call foo so it is not possible for celery to register any one of part functions to be created when it initializes a worker.
I think the following code is the closest to what you want.
def make_maintask():
#task
def subtask1():
print("do subtask")
# ...
#task
def _maintask():
chain(subtask1.si(), subtask2.si(), subtask3.si()).delay()
return _maintask
maintask = make_maintask()
This way, each definition of subtask and such is not visible from outside.
some comments
If all you want to do is hiding subtask, please think twice. The designer of the python language didn't believe that one needs a access control such as public and private as in java. It is a functionality that severely complicates a language with a dubious advantage. I think well-organized packages and modules and good names (such as adding underscores in front) can solve all your problems.
If all _maintask does is delegating subtasks to other subprocesses, you don't really need to define it as a celery task. Don't make a celery task call another celery task unless you really need it.

Python multiprocessing initialising the Pool

I have a large read-only object that I would like the child processes to use but unfortunately this object cannot be pickled. Given that it's read-only I thought about declaring it as a global and then using an initializing function in the Pool where I perform the necessary copying. My code is something like:
def f(processes, args):
global pool
pool = multiprocessing.Pool(processes,setGlobal,[args])
def setGlobal(args):
# global object to be used by the child processes...
global obj
obj = copy.deepcopy(args)
The function setGlobal performs the initialization. My first question concerns the arguments to setGlobal (which are passed as a list). Do these need to be pickle-able? The errors I'm getting seem to suggest that they do. If so, how can I make the unpickle-able read-only object visible to my child processes?

Parallel processing loop using multiprocessing Pool

I want to process a large for loop in parallel, and from what I have read the best way to do this is to use the multiprocessing library that comes standard with Python.
I have a list of around 40,000 objects, and I want to process them in parallel in a separate class. The reason for doing this in a separate class is mainly because of what I read here.
In one class I have all the objects in a list and via the multiprocessing.Pool and Pool.map functions I want to carry out parallel computations for each object by making it go through another class and return a value.
# ... some class that generates the list_objects
pool = multiprocessing.Pool(4)
results = pool.map(Parallel, self.list_objects)
And then I have a class which I want to process each object passed by the pool.map function:
class Parallel(object):
def __init__(self, args):
self.some_variable = args[0]
self.some_other_variable = args[1]
self.yet_another_variable = args[2]
self.result = None
def __call__(self):
self.result = self.calculate(self.some_variable)
The reason I have a call method is due to the post I linked before, yet I'm not sure I'm using it correctly as it seems to have no effect. I'm not getting the self.result value to be generated.
Any suggestions?
Thanks!
Use a plain function, not a class, when possible. Use a class only when there is a clear advantage to doing so.
If you really need to use a class, then given your setup, pass an instance of Parallel:
results = pool.map(Parallel(args), self.list_objects)
Since the instance has a __call__ method, the instance itself is callable, like a function.
By the way, the __call__ needs to accept an additional argument:
def __call__(self, val):
since pool.map is essentially going to call in parallel
p = Parallel(args)
result = []
for val in self.list_objects:
result.append(p(val))
Pool.map simply applies a function (actually, a callable) in parallel. It has no notion of objects or classes. Since you pass it a class, it simply calls __init__ - __call__ is never executed. You need to either call it explicitly from __init__ or use pool.map(Parallel.__call__, preinitialized_objects)

Python queue module difficulty

I'm an experienced programmer, but completely new to Python.
I've resolved most difficulties, but I can't get the queue module to work.
Any help gratefully received. Python 3.2.
Reduced to its basic minimum, here's the issue:
>>>import queue
>>>q = queue.Queue
>>>q.qsize()
Traceback:
...
q.qsize()
...
TypeError: qsize() takes 1 argument exactly (0 given)
Documentation...
7.8.1. Queue Objects
Queue objects (Queue, LifoQueue, or PriorityQueue) provide the public methods described below.
Queue.qsize()
OK - what argument.... ?
You're not initializing an instance, you're just reassigning the class name to q. The "argument" that it's talking about is self, the explicit self-reference that all Python methods need. In other words, it's saying that you're trying to call an instance method with no instance.
>>> q = queue.Queue()
>>> q.qsize()
If you've never seen a Python method definition, it looks something like this:
class Queue(object):
# Note the explicit 'self' argument
def qsize(self):
# ...
You are simply renaming queue.Queue and not instantiating an object.
Try this
q = queue.Queue()
print q.qsize()

Categories