Using Dask LocalCluster() within a modular python codebase

Using Dask LocalCluster() within a modular python codebase - python

I am trying to use Dask Distributed's LocalCluster to run code in parallel using all the cores of a single machine.
Consider a sample python data pipeline, with the folder structure below.
sample_dask_program
├── main.py
├── parallel_process_1.py
├── parallel_process_2.py
├── process_1.py
├── process_2.py
└── process_3.py
main.py is the entry point, which executes while pipeline sequentially.
Eg:
def run_pipeline():
stage_one_run_util()
stage_two_run_util()
...
stage_six_run_util()
if __name__ == '__main__':
...
run_pipeline()
parallel_process_1.py and parallel_process_2.py are modules which create a Client() and use futures to achieve parallelism.
with Client() as client:
# list to store futures after they are submitted
futures = []
for item in items:
future = client.submit(
...
)
futures.append(future)
results = client.gather(futures)
process_1.py, process_2.py and process_3.py are modules which do simple computation that need not be run in parallel using all the CPU cores.
Traceback:
File "/sm/src/calculation/parallel.py", line 140, in convert_qty_to_float
results = client.gather(futures)
File "/home/iouser/.local/lib/python3.7/site-packages/distributed/client.py", line 1894, in gather
asynchronous=asynchronous,
File "/home/iouser/.local/lib/python3.7/site-packages/distributed/client.py", line 778, in sync
self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
File "/home/iouser/.local/lib/python3.7/site-packages/distributed/utils.py", line 348, in sync
raise exc.with_traceback(tb)
File "/home/iouser/.local/lib/python3.7/site-packages/distributed/utils.py", line 332, in f
result[0] = yield future
File "/home/iouser/.local/lib/python3.7/site-packages/tornado/gen.py", line 735, in run
value = future.result()
concurrent.futures._base.CancelledError
This is the error thrown by the workers:
distributed.worker - ERROR - failed during get data with tcp://127.0.0.1:33901 -> tcp://127.0.0.1:38821
Traceback (most recent call last):
File "/home/iouser/.local/lib/python3.7/site-packages/distributed/comm/tcp.py", line 248, in write
future = stream.write(frame)
File "/home/iouser/.local/lib/python3.7/site-packages/tornado/iostream.py", line 546, in write
self._check_closed()
File "/home/iouser/.local/lib/python3.7/site-packages/tornado/iostream.py", line 1035, in _check_closed
raise StreamClosedError(real_error=self.error)
tornado.iostream.StreamClosedError: Stream is closed
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/iouser/.local/lib/python3.7/site-packages/distributed/worker.py", line 1248, in get_data
compressed = await comm.write(msg, serializers=serializers)
File "/home/iouser/.local/lib/python3.7/site-packages/distributed/comm/tcp.py", line 255, in write
convert_stream_closed_error(self, e)
File "/home/iouser/.local/lib/python3.7/site-packages/distributed/comm/tcp.py", line 121, in convert_stream_closed_error
raise CommClosedError("in %s: %s: %s" % (obj, exc.__class__.__name__, exc))
distributed.comm.core.CommClosedError: in <closed TCP>: BrokenPipeError: [Errno 32] Broken pipe
I am not able to locally reproduce this error or find a minimum reproducible example, as the occurrence of this error is abrupt.
Is this the right way to use Dask LocalCluster in a modular python program?
EDIT
I have observed that these errors come up when the LocalCluster is created with a relatively high number of threads and processes. I am doing computations which uses NumPy and Pandas and this is not a good practice as described here.
At times, when the LocalCluster is created using 4 workers and 16 processes, no error gets thrown. When the LocalCluster is created using 8 workers and 40 processes, the error I described above gets thrown.
As far as I understand, dask randomly selects this combination (is this an issue with dask?), as I tested on the same AWS Batch instance (with 8 cores (16 vCPUs)).
The issue does not pop up when I forcefully create the cluster with only threads.
Eg:
cluster = LocalCluster(processes=False)
with Client(cluster) as client:
client.submit(...)
...
But, creating the LocalCluster using only threads slows down the execution by 2-3 times.
So, is the solution to the problem, finding the right number of processes/threads suitable to the program?

It is more common to create a Dask Client once, and then run many workloads on it.
with Client() as client:
stage_one(client)
stage_two(client)
That being said, what you're doing should be fine. If you're able to reproduce the error with a minimal example, that would be useful (but no expectations).

Related

Is multiprocessing.Pool not allowed in Airflow task? - AssertionError: daemonic processes are not allowed to have children

Our airflow project has a task that queries from BigQuery and uses Pool to dump in parallel to local JSON files:
def dump_in_parallel(table_name):
base_query = f"select * from models.{table_name}"
all_conf_ids = range(1,10)
n_jobs = 4
with Pool(n_jobs) as p:
p.map(partial(dump_conf_id, base_query = base_query), all_conf_ids)
with open("/tmp/final_output.json", "wb") as f:
filenames = [f'/tmp/output_file_{i}.json' for i in all_conf_ids]
This task was working fine for us in airflow v1.10, but is no longer working in v2.1+. Section 2.1 here - https://blog.mbedded.ninja/programming/languages/python/python-multiprocessing/ - mentions "If you try and create a Pool from within a child worker that was already created with a Pool, you will run into the error: daemonic processes are not allowed to have children"
Here is the full Airflow error:
[2021-08-22 02:11:53,064] {taskinstance.py:1462} ERROR - Task failed with exception
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1164, in _run_raw_task
self._prepare_and_execute_task_with_callbacks(context, task)
File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1282, in _prepare_and_execute_task_with_callbacks
result = self._execute_task(context, task_copy)
File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1312, in _execute_task
result = task_copy.execute(context=context)
File "/usr/local/lib/python3.7/site-packages/airflow/operators/python.py", line 150, in execute
return_value = self.execute_callable()
File "/usr/local/lib/python3.7/site-packages/airflow/operators/python.py", line 161, in execute_callable
return self.python_callable(*self.op_args, **self.op_kwargs)
File "/usr/local/airflow/plugins/tasks/bigquery.py", line 249, in dump_in_parallel
with Pool(n_jobs) as p:
File "/usr/local/lib/python3.7/multiprocessing/context.py", line 119, in Pool
context=self.get_context())
File "/usr/local/lib/python3.7/multiprocessing/pool.py", line 176, in __init__
self._repopulate_pool()
File "/usr/local/lib/python3.7/multiprocessing/pool.py", line 241, in _repopulate_pool
w.start()
File "/usr/local/lib/python3.7/multiprocessing/process.py", line 110, in start
'daemonic processes are not allowed to have children'
AssertionError: daemonic processes are not allowed to have children
If it matters, we run airflow using the LocalExecutor. Any idea why this task that uses Pool would have been working in airflow v1.10 but no longer in airflow 2.1?

Airflow 2 uses different processing model under the hood to speed up processing, yet to maintain process-based isolation between running tasks.
That's why it uses forking and multiprocessing under the hook to run Tasks, but this also means that if you are using multiprocessing, you will hit the limits of Python multiprocessing that does not allow to chain multi-processing.
I am not 100% sure if it will work but you might try to set execute_tasks_new_python_interpreter configuration to True. https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#execute-tasks-new-python-interpreter . This setting will cause airflow to start a new Python interpreter when running task instead of forking/using multiprocessing (though I am not 100% sure of the latter). It will work quite a bit slower (up to a few seconds of overhead) though to run your task as the new Python interpreter will have to reinitialize and import all the airflow code before running your task.
If that does not work, then you can lunch your multiprocessing job using PythonVirtualenvOperator - that one will launch a new Python interpreter to run your python code and you should be able to use multiprocessing.

Replacing the multiprocessing library with billiard library works, per https://github.com/celery/celery/issues/4525. We have no idea why subbing one library in for the other resolves this issue though...

You can switch to joblib python library with lock backend and kill daemonic processes after execution with following instructions.

"Dictionary size changed during iteration" from Pebble ProcessPool

We've some parallel processing code which is built around Pebble, it's been working robustly for quite some time but we seem to have run into some odd edge-case.
Based on the exception trace (and the rock-simple code feeding it) I suspect that it's actually a bug in Pebble but who knows.
The code feeding the process pool is pretty trivial:
pool = ProcessPool(max_workers=10, max_tasks=10)
for path in filepaths:
try:
future = pool.schedule(function=self.analyse_file, args(path), timeout=30)
future.add_done_callback(self.process_result)
exception Exception as e:
print("Exception fired:" + e) # NOT where the exception is firing
pool.close()
pool.join()
So in essence, we schedule a bunch of stuff to run, close out the pool then wait for the pool to complete the scheduled tasks. NOTE: the exception is not being thrown in the schedule loop, it gets fired AFTER we call join().
This is the exception stack trace:
Traceback (most recent call last):
File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/home/user/.pyenv/versions/scrapeapp/lib/python3.6/site-packages/pebble/pool/process.py", line 150, in task_scheduler_loop
pool_manager.schedule(task)
File "/home/user/.pyenv/versions/scrapeapp/lib/python3.6/site-packages/pebble/pool/process.py", line 198, in schedule
self.worker_manager.dispatch(task)
File "/home/user/.pyenv/versions/scrapeapp/lib/python3.6/site-packages/pebble/pool/process.py", line 327, in dispatch
self.pool_channel.send(WorkerTask(task.id, task.payload))
File "/home/user/.pyenv/versions/scrapeapp/lib/python3.6/site-packages/pebble/pool/channel.py", line 66, in send
return self.writer.send(obj)
File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/multiprocessing/connection.py", line 206, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
RuntimeError: dictionary changed size during iteration
I think it's got to be some weird race condition, as the code will work flawlessly on some data sets but fail at what appears to be a random point on another dataset.
We were using pebble 4.3.1 when we first ran into the issue (same version we'd had since the beginning), tried upgrading to 4.5.0, no change.
Has anybody run into similar issues with Pebble in the past? If so what was your fix?

Django: running a few tasks in parallel

I have a web application build out of Django version 2.0.1
A user uploads a file, and based on the content, there are tasks which are executed in a serial fashion. After execution the results are shown to the user. Some of the tasks are independent of each other.
I want to execute the independent tasks in parallel. I tried using multiprocessing within views.py but there are some errors thrown when the processes are spawned. These tasks analyse some information and write to a file. The files are then combined to show the results to the user.
These tasks cannot be done asynchronous as the results produced needs to be shown to the user waiting. So I have dropped the idea of using Celery as recommended in other discussions.
Anyone's suggestions would be helpful.
Thanks
Error got
This was the error we gotTraceback (most recent call last):
C:\Users\idea\AppData\Local\Enthought\Canopy\edm\envs\python\lib\multiprocessing\spawn.py", line 106, in spawn_main
exitcode = _main(fd)
File "C:\Users\idea\AppData\Local\Enthought\Canopy\edm\envs\python\lib\multiprocessing\spawn.py", line 116, in _main
self = pickle.load(from_parent)
File "G:\work\gitrepo\suprath-github\smartdata\ssd\FinalPlots\uploads\core\views.py", line 6, in
from uploads.core.models import Document
File "G:\work\gitrepo\suprath-github\smartdata\ssd\FinalPlots\uploads\core\models.py", line 7, in
class Document(models.Model):
File "C:\Users\idea\AppData\Local\Enthought\Canopy\edm\envs\python\lib\site-packages\django\db\models\base.py", line 100, in new
app_config = apps.get_containing_app_config(module)
File "C:\Users\idea\AppData\Local\Enthought\Canopy\edm\envs\python\lib\site-packages\django\apps\registry.py", line 244, in get_containing_app_config
self.check_apps_ready()
File "C:\Users\idea\AppData\Local\Enthought\Canopy\edm\envs\python\lib\site-packages\django\apps\registry.py", line 127, in check_apps_ready
raise AppRegistryNotReady("Apps aren't loaded yet.")
django.core.exceptions.AppRegistryNotReady: Apps aren't loaded yet.
Traceback (most recent call last):

These tasks cannot be done asynchronous as the results produced needs to be shown to the user waiting
That doesn't mean you can't use an async queue (celery or other). We have a very similar use case and do use celery to run the tasks. The tasks (part parallel, part serial) store their progress in redis, and the frontend polls to get the current state and display progress to the user, then when the whole process is done (either successfuly or not) we display the result (or errors).

I agree with the solution provided by #bruno desthuillieres, however, you can implement some socket solution to reach back to the user.
Since polling from user may have huge performance impacts, the socket solution will ideal for this case.

Error in Python multiprocessing process

I am trying a write a python code having multiple processes whose structure and flow is something like this:
import multiprocessing
import ctypes
import time
import errno
m=multiprocessing.Manager()
mylist=m.list()
var1=m.Value('i',0)
var2=m.Value('i',1)
var3=m.Value('i',2)
var4=m.Value(ctypes.c_char_p,"a")
var5=m.Value(ctypes.c_char_p,"b")
var6=3
var7=4
var8=5
var9=6
var10=7
def func(var1,var2,var4,var5,mylist):
i=0
try:
if var1.value==0:
print var2.value,var4.value,var5.value
mylist.append(time.time())
elif var1.value==1:
i=i+2
print var2.value+2,var4.value,var5.value
mylist.append(time.time())
except IOError as e:
if e.errno==errno.EPIPE:
var3.value=var3.value+1
print "Error"
def work():
for i in range(var3.value):
print i,var6,var7,va8,var9,var10
p=multiprocessing.Process(target=func,args=(var1,var2,var4,var5,mylist))
p.start()
work()
When I run this code, sometimes it works perfectly, sometimes it does not run for exact amount of loop counts and sometimes I get following error:
0
1
Process Process-2:
Traceback (most recent call last):
File "/usr/lib64/python2.6/multiprocessing/process.py", line 232, in _bootstrap
self.run()
File "/usr/lib64/python2.6/multiprocessing/process.py", line 88, in run
self._target(*self._args, **self._kwargs)
File "dummy.py", line 19, in func
if var1.value==0:
File "/usr/lib64/python2.6/multiprocessing/managers.py", line 1005, in get
return self._callmethod('get')
File "/usr/lib64/python2.6/multiprocessing/managers.py", line 722, in _callmethod
self._connect()
File "/usr/lib64/python2.6/multiprocessing/managers.py", line 709, in _connect
conn = self._Client(self._token.address, authkey=self._authkey)
File "/usr/lib64/python2.6/multiprocessing/connection.py", line 149, in Client
answer_challenge(c, authkey)
File "/usr/lib64/python2.6/multiprocessing/connection.py", line 383, in answer_challenge
message = connection.recv_bytes(256) # reject large message
EOFError
What does this error mean? What wrong am I doing here? What this error indicates? Kindly guide me to the correct path. I am using CentOS 6.5

Working with shared variables in multiprocessing is tricky. Because of the python Global Interpreter Lock (GIL), multiprocessing is not directly possible in Python. When you use the multiprocessing module, you can launch several task on different process, BUT you can't share the memory.
In you case, you need this so you try to use shared memory. But what happens here is that you have several processes trying to read the same memory at the same time. To avoid memory corruption, a process lock the memory address it is currently reading, forbidding other processes to access it until it finishes reading.
Here you have 3 processes trying to evaluate var1.value in the first if loop of your func : the first process read the value, and the other are blocked, raising an error.
To avoid this mechanism, you should always manage the Lock of your shared variables yourself.
You can try with syntax:
var1=multiprocessing.Value('i',0) # create shared variable
var1.acquire() # get the lock : it will wait until lock is available
var1.value # read the value
var1.release() # release the lock
External documentation :
Locks : https://docs.python.org/2/librar/multiprocessing.html#synchronization-between-processes
GIL : https://docs.python.org/2/glossary.html#term-global-interpreter-lock

pytables crash with threads

The following code shows a problem in the interaction between pytables and threading. I'm creating an HDF file and reading it with 100 concurrent threads:
import threading
import pandas as pd
from pandas.io.pytables import HDFStore, get_store
filename='test.hdf'
with get_store(filename,mode='w') as store:
store['x'] = pd.DataFrame({'y': range(10000)})
def process(i,filename):
# print 'start', i
with get_store(filename,mode='r') as store:
df = store['x']
# print 'end', i
return df['y'].max
threads = []
for i in range(100):
t = threading.Thread(target=process, args = (i,filename,))
t.daemon = True
t.start()
threads.append(t)
for t in threads:
t.join()
The program usually executes cleanly. But now and then I get exceptions like this:
Exception in thread Thread-27:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 504, in run
self.__target(*self.__args, **self.__kwargs)
File "crash.py", line 13, in process
with get_store(filename,mode='r') as store:
File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
return self.gen.next()
File "/usr/local/lib/python2.7/dist-packages/pandas/io/pytables.py", line 259, in get_store
store = HDFStore(path, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/pytables.py", line 398, in __init__
self.open(mode=mode, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/pytables.py", line 528, in open
self._handle = tables.openFile(self._path, self._mode, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tables/_past.py", line 35, in oldfunc
return obj(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tables/file.py", line 298, in open_file
for filehandle in _open_files.get_handlers_by_name(filename):
RuntimeError: Set changed size during iteration
or
[...]
File "/usr/local/lib/python2.7/dist-packages/tables/_past.py", line 35, in oldfunc
return obj(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tables/file.py", line 299, in open_file
omode = filehandle.mode
AttributeError: 'File' object has no attribute 'mode'
While reducing the code I got very different error messages, some of them indicating memory corruption.
Here are my library versions:
>>> pd.__version__
'0.13.1'
>>> tables.__version__
'3.1.0'
I already have had an error with threads which occured in writing files and I solved it by recompiling hdf5 with options: --enable-threadsafe --with-pthread
Can anyone reproduce the problem? How to solve it?

Anthony already pointed out that hdf5 (PyTables is basically a wrapper around the hdf5 C library) is not thread-safe. If you want to access an hdf5 file from a web application, you have basically two options:
Use a dedicated process that handles all the hdf5 I/O. Processes/threads of the web application must communicate with this process through, e.g., Unix Domain Sockets. The downside of this approach — obviously — is that it scales very badly. If one web request is accessing the hdf5 file, all other requests must wait.
Implement a read-write locking mechanism that allows concurrent reading, but uses an exclusive lock for writing. Cf. http://en.wikipedia.org/wiki/Readers-writers_problem.
Note that with a mod_wsgi application — depending on the configuration — you have to deal with threads and processes!
I am also currently struggling with using hdf5 as a database backend for a web application. I think the 2nd approach above provides a decent solution. But still, hdf5 is not a database system. If you want a real array database server with a Python interface, have a look at http://www.scidb.org. It is not nearly as light-weight as an hdf5-based solution, though.

One bit that has not been mentioned yet, recompile HDF5 to be thread-safe using:
--enable-threadsafe --with-pthread=DIR
https://support.hdfgroup.org/HDF5/faq/threadsafe.html
I had some hard-to-find bugs in my keras code, which uses HDF5, and this was what solved it.

PyTables is not fully thread safe. Use multiprocess pools instead.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.