django-celery redis memoryerror - python

I'm using django+celery with redis as the broker, and one of my task involves reading large file about 25MB in size and returning the results, by which another task is chained to process the results.
I'm encountering the error here which due to my lack of familiarity with redis, I'm appealing for help. What might be the problem?
[2013-06-23 22:45:41,241: ERROR/MainProcess] Unrecoverable error: MemoryError()
Traceback (most recent call last):
File "/home/property/virtualenv/property_env/lib/python2.6/site-packages/celery/worker/__init__.py", line 363, in start
component.start()
File "/home/property/virtualenv/property_env/lib/python2.6/site-packages/celery/worker/consumer.py", line 395, in start
self.consume_messages()
File "/home/property/virtualenv/property_env/lib/python2.6/site-packages/celery/worker/consumer.py", line 480, in consume_messages
readers[fileno](fileno, event)
File "/home/property/virtualenv/property_env/lib/python2.6/site-packages/kombu/transport/redis.py", line 770, in handle_event
self._callbacks[queue](message)
File "/home/property/virtualenv/property_env/lib/python2.6/site-packages/kombu/transport/virtual/__init__.py", line 479, in _callback
self.qos.append(message, message.delivery_tag)
File "/home/property/virtualenv/property_env/lib/python2.6/site-packages/kombu/transport/redis.py", line 117, in append
dumps([message._raw, EX, RK])) \
File "/home/property/virtualenv/property_env/lib/python2.6/site-packages/redis/client.py", line 1808, in execute
return execute(conn, stack, raise_on_error)
File "/home/property/virtualenv/property_env/lib/python2.6/site-packages/redis/client.py", line 1705, in _execute_transaction
[args for args, options in cmds]))
MemoryError
Not sure if it provides any hint, but checking the maxmemory setting on redis server doesn't seem to be the problem.
redis 127.0.0.1:6379> config get maxmemory
1) "maxmemory"
2) "3758096384"

It looks that the memory error is not on the redis side but at client side (celery worker)
My guess is that the worker is running out of memory
You should make sure the celery process can actually allocate the result coming from redis into memory.
If this happens after few tasks are executed it means that you dont have enough memory to handle the concurrency you set or that you are leaking memory (leaving references of the object from redis) somewhere

Related

Is multiprocessing.Pool not allowed in Airflow task? - AssertionError: daemonic processes are not allowed to have children

Our airflow project has a task that queries from BigQuery and uses Pool to dump in parallel to local JSON files:
def dump_in_parallel(table_name):
base_query = f"select * from models.{table_name}"
all_conf_ids = range(1,10)
n_jobs = 4
with Pool(n_jobs) as p:
p.map(partial(dump_conf_id, base_query = base_query), all_conf_ids)
with open("/tmp/final_output.json", "wb") as f:
filenames = [f'/tmp/output_file_{i}.json' for i in all_conf_ids]
This task was working fine for us in airflow v1.10, but is no longer working in v2.1+. Section 2.1 here - https://blog.mbedded.ninja/programming/languages/python/python-multiprocessing/ - mentions "If you try and create a Pool from within a child worker that was already created with a Pool, you will run into the error: daemonic processes are not allowed to have children"
Here is the full Airflow error:
[2021-08-22 02:11:53,064] {taskinstance.py:1462} ERROR - Task failed with exception
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1164, in _run_raw_task
self._prepare_and_execute_task_with_callbacks(context, task)
File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1282, in _prepare_and_execute_task_with_callbacks
result = self._execute_task(context, task_copy)
File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1312, in _execute_task
result = task_copy.execute(context=context)
File "/usr/local/lib/python3.7/site-packages/airflow/operators/python.py", line 150, in execute
return_value = self.execute_callable()
File "/usr/local/lib/python3.7/site-packages/airflow/operators/python.py", line 161, in execute_callable
return self.python_callable(*self.op_args, **self.op_kwargs)
File "/usr/local/airflow/plugins/tasks/bigquery.py", line 249, in dump_in_parallel
with Pool(n_jobs) as p:
File "/usr/local/lib/python3.7/multiprocessing/context.py", line 119, in Pool
context=self.get_context())
File "/usr/local/lib/python3.7/multiprocessing/pool.py", line 176, in __init__
self._repopulate_pool()
File "/usr/local/lib/python3.7/multiprocessing/pool.py", line 241, in _repopulate_pool
w.start()
File "/usr/local/lib/python3.7/multiprocessing/process.py", line 110, in start
'daemonic processes are not allowed to have children'
AssertionError: daemonic processes are not allowed to have children
If it matters, we run airflow using the LocalExecutor. Any idea why this task that uses Pool would have been working in airflow v1.10 but no longer in airflow 2.1?
Airflow 2 uses different processing model under the hood to speed up processing, yet to maintain process-based isolation between running tasks.
That's why it uses forking and multiprocessing under the hook to run Tasks, but this also means that if you are using multiprocessing, you will hit the limits of Python multiprocessing that does not allow to chain multi-processing.
I am not 100% sure if it will work but you might try to set execute_tasks_new_python_interpreter configuration to True. https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#execute-tasks-new-python-interpreter . This setting will cause airflow to start a new Python interpreter when running task instead of forking/using multiprocessing (though I am not 100% sure of the latter). It will work quite a bit slower (up to a few seconds of overhead) though to run your task as the new Python interpreter will have to reinitialize and import all the airflow code before running your task.
If that does not work, then you can lunch your multiprocessing job using PythonVirtualenvOperator - that one will launch a new Python interpreter to run your python code and you should be able to use multiprocessing.
Replacing the multiprocessing library with billiard library works, per https://github.com/celery/celery/issues/4525. We have no idea why subbing one library in for the other resolves this issue though...
You can switch to joblib python library with lock backend and kill daemonic processes after execution with following instructions.

GUnicorn and shared dictionary on REST API: "Ran out of input" Error on high load

I am using a manager.dict to synchronize some data between multiple workers of an API served with GUnicorn (with Meinheld workers). While this works fine for a few concurrent queries, it breaks when I fire about 100 queries simultaneously at the API and I get displayed the following stack trace:
2020-07-16 12:35:38,972-app.api.my_resource-ERROR-140298393573184-on_post-175-Ran out of input
Traceback (most recent call last):
File "/app/api/my_resource.py", line 163, in on_post
results = self.do_something(a, b, c, **d)
File "/app/user_data/data_lookup.py", line 39, in lookup_something
return (a in self._shared_dict
File "<string>", line 2, in __contains__
File "/usr/local/lib/python3.6/multiprocessing/managers.py", line 757, in _callmethod
kind, result = conn.recv()
File "/usr/local/lib/python3.6/multiprocessing/connection.py", line 251, in recv
return _ForkingPickler.loads(buf.getbuffer())
EOFError: Ran out of input
2020-07-16 12:35:38,972-app.api.my_resource-ERROR-140298393573184-on_post-175-unpickling stack underflow
Traceback (most recent call last):
File "/app/api/my_resource.py", line 163, in on_post
results = self.do_something(a, b, c, **d)
File "/app/user_data/data_lookup.py", line 39, in lookup_something
return (a in self._shared_dict
File "<string>", line 2, in __contains__
File "/usr/local/lib/python3.6/multiprocessing/managers.py", line 757, in _callmethod
kind, result = conn.recv()
File "/usr/local/lib/python3.6/multiprocessing/connection.py", line 251, in recv
return _ForkingPickler.loads(buf.getbuffer())
_pickle.UnpicklingError: unpickling stack underflow
My API framework is falcon. I have a dictionary containing user data that can be updated via POST requests. The architecture should be simple, so I chose Manager.dict() from the multiprocessing package to store the data. When doing other queries, this some input will be checked against the contents of this dictionary (if a in self._shared_dict: ...). This is where the above-mentioned errors occur.
Why is this problem happening? It seems to be tied to the manager.dict. Besides, when I do debugging in PyCharm, it also happens that the debugger does not evaluate any variables and often just hangs infinitely somewhere in multiprocessing code waiting for data.
It seems to have something to do with the Meinheld workers. When I configure GUnicorn to use the default sync worker class, this error does not occur anymore. Hence, Python multiprocessing and the Meinheld package seem not to work well in my setting.

"Dictionary size changed during iteration" from Pebble ProcessPool

We've some parallel processing code which is built around Pebble, it's been working robustly for quite some time but we seem to have run into some odd edge-case.
Based on the exception trace (and the rock-simple code feeding it) I suspect that it's actually a bug in Pebble but who knows.
The code feeding the process pool is pretty trivial:
pool = ProcessPool(max_workers=10, max_tasks=10)
for path in filepaths:
try:
future = pool.schedule(function=self.analyse_file, args(path), timeout=30)
future.add_done_callback(self.process_result)
exception Exception as e:
print("Exception fired:" + e) # NOT where the exception is firing
pool.close()
pool.join()
So in essence, we schedule a bunch of stuff to run, close out the pool then wait for the pool to complete the scheduled tasks. NOTE: the exception is not being thrown in the schedule loop, it gets fired AFTER we call join().
This is the exception stack trace:
Traceback (most recent call last):
File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/home/user/.pyenv/versions/scrapeapp/lib/python3.6/site-packages/pebble/pool/process.py", line 150, in task_scheduler_loop
pool_manager.schedule(task)
File "/home/user/.pyenv/versions/scrapeapp/lib/python3.6/site-packages/pebble/pool/process.py", line 198, in schedule
self.worker_manager.dispatch(task)
File "/home/user/.pyenv/versions/scrapeapp/lib/python3.6/site-packages/pebble/pool/process.py", line 327, in dispatch
self.pool_channel.send(WorkerTask(task.id, task.payload))
File "/home/user/.pyenv/versions/scrapeapp/lib/python3.6/site-packages/pebble/pool/channel.py", line 66, in send
return self.writer.send(obj)
File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/multiprocessing/connection.py", line 206, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
RuntimeError: dictionary changed size during iteration
I think it's got to be some weird race condition, as the code will work flawlessly on some data sets but fail at what appears to be a random point on another dataset.
We were using pebble 4.3.1 when we first ran into the issue (same version we'd had since the beginning), tried upgrading to 4.5.0, no change.
Has anybody run into similar issues with Pebble in the past? If so what was your fix?

Django: running a few tasks in parallel

I have a web application build out of Django version 2.0.1
A user uploads a file, and based on the content, there are tasks which are executed in a serial fashion. After execution the results are shown to the user. Some of the tasks are independent of each other.
I want to execute the independent tasks in parallel. I tried using multiprocessing within views.py but there are some errors thrown when the processes are spawned. These tasks analyse some information and write to a file. The files are then combined to show the results to the user.
These tasks cannot be done asynchronous as the results produced needs to be shown to the user waiting. So I have dropped the idea of using Celery as recommended in other discussions.
Anyone's suggestions would be helpful.
Thanks
Error got
This was the error we gotTraceback (most recent call last):
C:\Users\idea\AppData\Local\Enthought\Canopy\edm\envs\python\lib\multiprocessing\spawn.py", line 106, in spawn_main
exitcode = _main(fd)
File "C:\Users\idea\AppData\Local\Enthought\Canopy\edm\envs\python\lib\multiprocessing\spawn.py", line 116, in _main
self = pickle.load(from_parent)
File "G:\work\gitrepo\suprath-github\smartdata\ssd\FinalPlots\uploads\core\views.py", line 6, in
from uploads.core.models import Document
File "G:\work\gitrepo\suprath-github\smartdata\ssd\FinalPlots\uploads\core\models.py", line 7, in
class Document(models.Model):
File "C:\Users\idea\AppData\Local\Enthought\Canopy\edm\envs\python\lib\site-packages\django\db\models\base.py", line 100, in new
app_config = apps.get_containing_app_config(module)
File "C:\Users\idea\AppData\Local\Enthought\Canopy\edm\envs\python\lib\site-packages\django\apps\registry.py", line 244, in get_containing_app_config
self.check_apps_ready()
File "C:\Users\idea\AppData\Local\Enthought\Canopy\edm\envs\python\lib\site-packages\django\apps\registry.py", line 127, in check_apps_ready
raise AppRegistryNotReady("Apps aren't loaded yet.")
django.core.exceptions.AppRegistryNotReady: Apps aren't loaded yet.
Traceback (most recent call last):
These tasks cannot be done asynchronous as the results produced needs to be shown to the user waiting
That doesn't mean you can't use an async queue (celery or other). We have a very similar use case and do use celery to run the tasks. The tasks (part parallel, part serial) store their progress in redis, and the frontend polls to get the current state and display progress to the user, then when the whole process is done (either successfuly or not) we display the result (or errors).
I agree with the solution provided by #bruno desthuillieres, however, you can implement some socket solution to reach back to the user.
Since polling from user may have huge performance impacts, the socket solution will ideal for this case.

passing file instance as an argument to the celery task raises "ValueError: I/O operation on closed file"

I need to pass file as argument to the celery task, but the passed file somehow got there closed. It happens just in case I'm executing the task asynchronous way.
Is this an expected behavior?
views:
from engine.tasks import s3_upload_handler
def myfunc():
f = open('/app/uploads/pic.jpg', 'rb')
s3_file_handler.apply_async(kwargs={"uploaded_file" : f,"file_name" : "test.jpg"})
tasks:
def s3_upload_handler(uploaded_file,file_name):
...
#some code for uploading to s3
traceback:
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/celery/app/trace.py", line 240, in trace_task
R = retval = fun(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/celery/app/trace.py", line 437, in __protected_call__
return self.run(*args, **kwargs)
File "/app/photohosting/engine/tasks.py", line 34, in s3_upload_handler
key.set_contents_from_file(uploaded_file)
File "/usr/local/lib/python2.7/dist-packages/boto/s3/key.py", line 1217, in set_contents_from_file
spos = fp.tell()
ValueError: I/O operation on closed file
flower logs:
kwargs {
'file_name': 'test.jpg',
'uploaded_file': <closed file '<uninitialized file>',
mode '<uninitialized file>' at 0x7f6ab9e75e40>
}
Yes, of course, the file would get there closed. Asynchronous celery tasks run in a completely separate process (moreover, they can even run on a different machine) and there is no way to pass an open file to it.
You should close the file in the process from where you call the task, and then pass its name and maybe position in file (if you need it) to the task and then reopen it in the task.
Another way of doing this would be to open the file and get a binary blob which you transfer over the wire. Of course if the file is really large then what #Vasily says is better but wont work in case of the worker running on a different m/c (unless your file is on a shared storage).

Categories