Django: running a few tasks in parallel - python

I have a web application build out of Django version 2.0.1
A user uploads a file, and based on the content, there are tasks which are executed in a serial fashion. After execution the results are shown to the user. Some of the tasks are independent of each other.
I want to execute the independent tasks in parallel. I tried using multiprocessing within views.py but there are some errors thrown when the processes are spawned. These tasks analyse some information and write to a file. The files are then combined to show the results to the user.
These tasks cannot be done asynchronous as the results produced needs to be shown to the user waiting. So I have dropped the idea of using Celery as recommended in other discussions.
Anyone's suggestions would be helpful.
Thanks
Error got
This was the error we gotTraceback (most recent call last):
C:\Users\idea\AppData\Local\Enthought\Canopy\edm\envs\python\lib\multiprocessing\spawn.py", line 106, in spawn_main
exitcode = _main(fd)
File "C:\Users\idea\AppData\Local\Enthought\Canopy\edm\envs\python\lib\multiprocessing\spawn.py", line 116, in _main
self = pickle.load(from_parent)
File "G:\work\gitrepo\suprath-github\smartdata\ssd\FinalPlots\uploads\core\views.py", line 6, in
from uploads.core.models import Document
File "G:\work\gitrepo\suprath-github\smartdata\ssd\FinalPlots\uploads\core\models.py", line 7, in
class Document(models.Model):
File "C:\Users\idea\AppData\Local\Enthought\Canopy\edm\envs\python\lib\site-packages\django\db\models\base.py", line 100, in new
app_config = apps.get_containing_app_config(module)
File "C:\Users\idea\AppData\Local\Enthought\Canopy\edm\envs\python\lib\site-packages\django\apps\registry.py", line 244, in get_containing_app_config
self.check_apps_ready()
File "C:\Users\idea\AppData\Local\Enthought\Canopy\edm\envs\python\lib\site-packages\django\apps\registry.py", line 127, in check_apps_ready
raise AppRegistryNotReady("Apps aren't loaded yet.")
django.core.exceptions.AppRegistryNotReady: Apps aren't loaded yet.
Traceback (most recent call last):

These tasks cannot be done asynchronous as the results produced needs to be shown to the user waiting
That doesn't mean you can't use an async queue (celery or other). We have a very similar use case and do use celery to run the tasks. The tasks (part parallel, part serial) store their progress in redis, and the frontend polls to get the current state and display progress to the user, then when the whole process is done (either successfuly or not) we display the result (or errors).

I agree with the solution provided by #bruno desthuillieres, however, you can implement some socket solution to reach back to the user.
Since polling from user may have huge performance impacts, the socket solution will ideal for this case.

Related

Is multiprocessing.Pool not allowed in Airflow task? - AssertionError: daemonic processes are not allowed to have children

Our airflow project has a task that queries from BigQuery and uses Pool to dump in parallel to local JSON files:
def dump_in_parallel(table_name):
base_query = f"select * from models.{table_name}"
all_conf_ids = range(1,10)
n_jobs = 4
with Pool(n_jobs) as p:
p.map(partial(dump_conf_id, base_query = base_query), all_conf_ids)
with open("/tmp/final_output.json", "wb") as f:
filenames = [f'/tmp/output_file_{i}.json' for i in all_conf_ids]
This task was working fine for us in airflow v1.10, but is no longer working in v2.1+. Section 2.1 here - https://blog.mbedded.ninja/programming/languages/python/python-multiprocessing/ - mentions "If you try and create a Pool from within a child worker that was already created with a Pool, you will run into the error: daemonic processes are not allowed to have children"
Here is the full Airflow error:
[2021-08-22 02:11:53,064] {taskinstance.py:1462} ERROR - Task failed with exception
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1164, in _run_raw_task
self._prepare_and_execute_task_with_callbacks(context, task)
File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1282, in _prepare_and_execute_task_with_callbacks
result = self._execute_task(context, task_copy)
File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1312, in _execute_task
result = task_copy.execute(context=context)
File "/usr/local/lib/python3.7/site-packages/airflow/operators/python.py", line 150, in execute
return_value = self.execute_callable()
File "/usr/local/lib/python3.7/site-packages/airflow/operators/python.py", line 161, in execute_callable
return self.python_callable(*self.op_args, **self.op_kwargs)
File "/usr/local/airflow/plugins/tasks/bigquery.py", line 249, in dump_in_parallel
with Pool(n_jobs) as p:
File "/usr/local/lib/python3.7/multiprocessing/context.py", line 119, in Pool
context=self.get_context())
File "/usr/local/lib/python3.7/multiprocessing/pool.py", line 176, in __init__
self._repopulate_pool()
File "/usr/local/lib/python3.7/multiprocessing/pool.py", line 241, in _repopulate_pool
w.start()
File "/usr/local/lib/python3.7/multiprocessing/process.py", line 110, in start
'daemonic processes are not allowed to have children'
AssertionError: daemonic processes are not allowed to have children
If it matters, we run airflow using the LocalExecutor. Any idea why this task that uses Pool would have been working in airflow v1.10 but no longer in airflow 2.1?
Airflow 2 uses different processing model under the hood to speed up processing, yet to maintain process-based isolation between running tasks.
That's why it uses forking and multiprocessing under the hook to run Tasks, but this also means that if you are using multiprocessing, you will hit the limits of Python multiprocessing that does not allow to chain multi-processing.
I am not 100% sure if it will work but you might try to set execute_tasks_new_python_interpreter configuration to True. https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#execute-tasks-new-python-interpreter . This setting will cause airflow to start a new Python interpreter when running task instead of forking/using multiprocessing (though I am not 100% sure of the latter). It will work quite a bit slower (up to a few seconds of overhead) though to run your task as the new Python interpreter will have to reinitialize and import all the airflow code before running your task.
If that does not work, then you can lunch your multiprocessing job using PythonVirtualenvOperator - that one will launch a new Python interpreter to run your python code and you should be able to use multiprocessing.
Replacing the multiprocessing library with billiard library works, per https://github.com/celery/celery/issues/4525. We have no idea why subbing one library in for the other resolves this issue though...
You can switch to joblib python library with lock backend and kill daemonic processes after execution with following instructions.

GUnicorn and shared dictionary on REST API: "Ran out of input" Error on high load

I am using a manager.dict to synchronize some data between multiple workers of an API served with GUnicorn (with Meinheld workers). While this works fine for a few concurrent queries, it breaks when I fire about 100 queries simultaneously at the API and I get displayed the following stack trace:
2020-07-16 12:35:38,972-app.api.my_resource-ERROR-140298393573184-on_post-175-Ran out of input
Traceback (most recent call last):
File "/app/api/my_resource.py", line 163, in on_post
results = self.do_something(a, b, c, **d)
File "/app/user_data/data_lookup.py", line 39, in lookup_something
return (a in self._shared_dict
File "<string>", line 2, in __contains__
File "/usr/local/lib/python3.6/multiprocessing/managers.py", line 757, in _callmethod
kind, result = conn.recv()
File "/usr/local/lib/python3.6/multiprocessing/connection.py", line 251, in recv
return _ForkingPickler.loads(buf.getbuffer())
EOFError: Ran out of input
2020-07-16 12:35:38,972-app.api.my_resource-ERROR-140298393573184-on_post-175-unpickling stack underflow
Traceback (most recent call last):
File "/app/api/my_resource.py", line 163, in on_post
results = self.do_something(a, b, c, **d)
File "/app/user_data/data_lookup.py", line 39, in lookup_something
return (a in self._shared_dict
File "<string>", line 2, in __contains__
File "/usr/local/lib/python3.6/multiprocessing/managers.py", line 757, in _callmethod
kind, result = conn.recv()
File "/usr/local/lib/python3.6/multiprocessing/connection.py", line 251, in recv
return _ForkingPickler.loads(buf.getbuffer())
_pickle.UnpicklingError: unpickling stack underflow
My API framework is falcon. I have a dictionary containing user data that can be updated via POST requests. The architecture should be simple, so I chose Manager.dict() from the multiprocessing package to store the data. When doing other queries, this some input will be checked against the contents of this dictionary (if a in self._shared_dict: ...). This is where the above-mentioned errors occur.
Why is this problem happening? It seems to be tied to the manager.dict. Besides, when I do debugging in PyCharm, it also happens that the debugger does not evaluate any variables and often just hangs infinitely somewhere in multiprocessing code waiting for data.
It seems to have something to do with the Meinheld workers. When I configure GUnicorn to use the default sync worker class, this error does not occur anymore. Hence, Python multiprocessing and the Meinheld package seem not to work well in my setting.

Using Dask LocalCluster() within a modular python codebase

I am trying to use Dask Distributed's LocalCluster to run code in parallel using all the cores of a single machine.
Consider a sample python data pipeline, with the folder structure below.
sample_dask_program
├── main.py
├── parallel_process_1.py
├── parallel_process_2.py
├── process_1.py
├── process_2.py
└── process_3.py
main.py is the entry point, which executes while pipeline sequentially.
Eg:
def run_pipeline():
stage_one_run_util()
stage_two_run_util()
...
stage_six_run_util()
if __name__ == '__main__':
...
run_pipeline()
parallel_process_1.py and parallel_process_2.py are modules which create a Client() and use futures to achieve parallelism.
with Client() as client:
# list to store futures after they are submitted
futures = []
for item in items:
future = client.submit(
...
)
futures.append(future)
results = client.gather(futures)
process_1.py, process_2.py and process_3.py are modules which do simple computation that need not be run in parallel using all the CPU cores.
Traceback:
File "/sm/src/calculation/parallel.py", line 140, in convert_qty_to_float
results = client.gather(futures)
File "/home/iouser/.local/lib/python3.7/site-packages/distributed/client.py", line 1894, in gather
asynchronous=asynchronous,
File "/home/iouser/.local/lib/python3.7/site-packages/distributed/client.py", line 778, in sync
self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
File "/home/iouser/.local/lib/python3.7/site-packages/distributed/utils.py", line 348, in sync
raise exc.with_traceback(tb)
File "/home/iouser/.local/lib/python3.7/site-packages/distributed/utils.py", line 332, in f
result[0] = yield future
File "/home/iouser/.local/lib/python3.7/site-packages/tornado/gen.py", line 735, in run
value = future.result()
concurrent.futures._base.CancelledError
This is the error thrown by the workers:
distributed.worker - ERROR - failed during get data with tcp://127.0.0.1:33901 -> tcp://127.0.0.1:38821
Traceback (most recent call last):
File "/home/iouser/.local/lib/python3.7/site-packages/distributed/comm/tcp.py", line 248, in write
future = stream.write(frame)
File "/home/iouser/.local/lib/python3.7/site-packages/tornado/iostream.py", line 546, in write
self._check_closed()
File "/home/iouser/.local/lib/python3.7/site-packages/tornado/iostream.py", line 1035, in _check_closed
raise StreamClosedError(real_error=self.error)
tornado.iostream.StreamClosedError: Stream is closed
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/iouser/.local/lib/python3.7/site-packages/distributed/worker.py", line 1248, in get_data
compressed = await comm.write(msg, serializers=serializers)
File "/home/iouser/.local/lib/python3.7/site-packages/distributed/comm/tcp.py", line 255, in write
convert_stream_closed_error(self, e)
File "/home/iouser/.local/lib/python3.7/site-packages/distributed/comm/tcp.py", line 121, in convert_stream_closed_error
raise CommClosedError("in %s: %s: %s" % (obj, exc.__class__.__name__, exc))
distributed.comm.core.CommClosedError: in <closed TCP>: BrokenPipeError: [Errno 32] Broken pipe
I am not able to locally reproduce this error or find a minimum reproducible example, as the occurrence of this error is abrupt.
Is this the right way to use Dask LocalCluster in a modular python program?
EDIT
I have observed that these errors come up when the LocalCluster is created with a relatively high number of threads and processes. I am doing computations which uses NumPy and Pandas and this is not a good practice as described here.
At times, when the LocalCluster is created using 4 workers and 16 processes, no error gets thrown. When the LocalCluster is created using 8 workers and 40 processes, the error I described above gets thrown.
As far as I understand, dask randomly selects this combination (is this an issue with dask?), as I tested on the same AWS Batch instance (with 8 cores (16 vCPUs)).
The issue does not pop up when I forcefully create the cluster with only threads.
Eg:
cluster = LocalCluster(processes=False)
with Client(cluster) as client:
client.submit(...)
...
But, creating the LocalCluster using only threads slows down the execution by 2-3 times.
So, is the solution to the problem, finding the right number of processes/threads suitable to the program?
It is more common to create a Dask Client once, and then run many workloads on it.
with Client() as client:
stage_one(client)
stage_two(client)
That being said, what you're doing should be fine. If you're able to reproduce the error with a minimal example, that would be useful (but no expectations).

django-celery redis memoryerror

I'm using django+celery with redis as the broker, and one of my task involves reading large file about 25MB in size and returning the results, by which another task is chained to process the results.
I'm encountering the error here which due to my lack of familiarity with redis, I'm appealing for help. What might be the problem?
[2013-06-23 22:45:41,241: ERROR/MainProcess] Unrecoverable error: MemoryError()
Traceback (most recent call last):
File "/home/property/virtualenv/property_env/lib/python2.6/site-packages/celery/worker/__init__.py", line 363, in start
component.start()
File "/home/property/virtualenv/property_env/lib/python2.6/site-packages/celery/worker/consumer.py", line 395, in start
self.consume_messages()
File "/home/property/virtualenv/property_env/lib/python2.6/site-packages/celery/worker/consumer.py", line 480, in consume_messages
readers[fileno](fileno, event)
File "/home/property/virtualenv/property_env/lib/python2.6/site-packages/kombu/transport/redis.py", line 770, in handle_event
self._callbacks[queue](message)
File "/home/property/virtualenv/property_env/lib/python2.6/site-packages/kombu/transport/virtual/__init__.py", line 479, in _callback
self.qos.append(message, message.delivery_tag)
File "/home/property/virtualenv/property_env/lib/python2.6/site-packages/kombu/transport/redis.py", line 117, in append
dumps([message._raw, EX, RK])) \
File "/home/property/virtualenv/property_env/lib/python2.6/site-packages/redis/client.py", line 1808, in execute
return execute(conn, stack, raise_on_error)
File "/home/property/virtualenv/property_env/lib/python2.6/site-packages/redis/client.py", line 1705, in _execute_transaction
[args for args, options in cmds]))
MemoryError
Not sure if it provides any hint, but checking the maxmemory setting on redis server doesn't seem to be the problem.
redis 127.0.0.1:6379> config get maxmemory
1) "maxmemory"
2) "3758096384"
It looks that the memory error is not on the redis side but at client side (celery worker)
My guess is that the worker is running out of memory
You should make sure the celery process can actually allocate the result coming from redis into memory.
If this happens after few tasks are executed it means that you dont have enough memory to handle the concurrency you set or that you are leaking memory (leaving references of the object from redis) somewhere

Django exception bugging me, don't know how to debug it

I recently upgraded to python2.7 and django1.3 and since then
Unhandled exception in thread started by <bound method Command.inner_run of <django.core.management.commands.runserver.Command object at 0x109c57490>>
Traceback (most recent call last):
File "/Users/ApPeL/.virtualenvs/myhunt/lib/python2.7/site-packages/django/core/management/commands/runserver.py", line 88, in inner_run
self.validate(display_num_errors=True)
File "/Users/ApPeL/.virtualenvs/myhunt/lib/python2.7/site-packages/django/core/management/base.py", line 249, in validate
num_errors = get_validation_errors(s, app)
File "/Users/ApPeL/.virtualenvs/myhunt/lib/python2.7/site-packages/django/core/management/validation.py", line 36, in get_validation_errors
for (app_name, error) in get_app_errors().items():
File "/Users/ApPeL/.virtualenvs/myhunt/lib/python2.7/site-packages/django/db/models/loading.py", line 146, in get_app_errors
self._populate()
File "/Users/ApPeL/.virtualenvs/myhunt/lib/python2.7/site-packages/django/db/models/loading.py", line 67, in _populate
self.write_lock.release()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 137, in release
raise RuntimeError("cannot release un-acquired lock")
RuntimeError: cannot release un-acquired lock
Your help would be greatly appreciated.
A usual first recommendation is to apply the latest updates to gevent or greenlet or what you use related to threads. Implementation of threading.Thread.start has been changed between Python 2.6 and 2.7. There are many recipes how to start green... or green... with Django. Try to read any recent for Python 2.7. and send a link which one makes the problem.
Debugging:
Add following lines to your manage.py to enable logging of thread start etc. to stderr:
import threading
setattr(threading, '__debug__', True)
Add the argument verbose to django/db/loading.py line 39 in order to see also what threads acquire and release the lock.
- write_lock = threading.RLock(),
+ write_lock = threading.RLock(verbose=True),
Run development server. For only one thread without autoreload you should see something like:
$ python manage.py runserver --noreload
Validating models...
MainThread: <_RLock owner='MainThread' count=1>.acquire(1): initial success
MainThread: <_RLock owner=None count=0>.release(): final release
Notes:
count=1 acquire(1) -- the first acquire by a blocking lock
owner=None count=0>.release() -- the the lock is currently being unlocked
$ python manage.py runserver
Validating models...
Dummy-1: <_RLock owner=-1222960272 count=1>.acquire(1): initial success
Dummy-1: <_RLock owner=None count=0>.release(): final release
This is the same with autoreload. Models are validated by the child process.
"Dummy-1" is a symbolic name of the thread. This can be repeated for more threads, but no threads should/can acquire the lock until it is released by the previous thread. We can continue according the results.

Categories