pytables crash with threads - python

The following code shows a problem in the interaction between pytables and threading. I'm creating an HDF file and reading it with 100 concurrent threads:
import threading
import pandas as pd
from pandas.io.pytables import HDFStore, get_store
filename='test.hdf'
with get_store(filename,mode='w') as store:
store['x'] = pd.DataFrame({'y': range(10000)})
def process(i,filename):
# print 'start', i
with get_store(filename,mode='r') as store:
df = store['x']
# print 'end', i
return df['y'].max
threads = []
for i in range(100):
t = threading.Thread(target=process, args = (i,filename,))
t.daemon = True
t.start()
threads.append(t)
for t in threads:
t.join()
The program usually executes cleanly. But now and then I get exceptions like this:
Exception in thread Thread-27:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 504, in run
self.__target(*self.__args, **self.__kwargs)
File "crash.py", line 13, in process
with get_store(filename,mode='r') as store:
File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
return self.gen.next()
File "/usr/local/lib/python2.7/dist-packages/pandas/io/pytables.py", line 259, in get_store
store = HDFStore(path, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/pytables.py", line 398, in __init__
self.open(mode=mode, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/pytables.py", line 528, in open
self._handle = tables.openFile(self._path, self._mode, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tables/_past.py", line 35, in oldfunc
return obj(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tables/file.py", line 298, in open_file
for filehandle in _open_files.get_handlers_by_name(filename):
RuntimeError: Set changed size during iteration
or
[...]
File "/usr/local/lib/python2.7/dist-packages/tables/_past.py", line 35, in oldfunc
return obj(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tables/file.py", line 299, in open_file
omode = filehandle.mode
AttributeError: 'File' object has no attribute 'mode'
While reducing the code I got very different error messages, some of them indicating memory corruption.
Here are my library versions:
>>> pd.__version__
'0.13.1'
>>> tables.__version__
'3.1.0'
I already have had an error with threads which occured in writing files and I solved it by recompiling hdf5 with options: --enable-threadsafe --with-pthread
Can anyone reproduce the problem? How to solve it?

Anthony already pointed out that hdf5 (PyTables is basically a wrapper around the hdf5 C library) is not thread-safe. If you want to access an hdf5 file from a web application, you have basically two options:
Use a dedicated process that handles all the hdf5 I/O. Processes/threads of the web application must communicate with this process through, e.g., Unix Domain Sockets. The downside of this approach — obviously — is that it scales very badly. If one web request is accessing the hdf5 file, all other requests must wait.
Implement a read-write locking mechanism that allows concurrent reading, but uses an exclusive lock for writing. Cf. http://en.wikipedia.org/wiki/Readers-writers_problem.
Note that with a mod_wsgi application — depending on the configuration — you have to deal with threads and processes!
I am also currently struggling with using hdf5 as a database backend for a web application. I think the 2nd approach above provides a decent solution. But still, hdf5 is not a database system. If you want a real array database server with a Python interface, have a look at http://www.scidb.org. It is not nearly as light-weight as an hdf5-based solution, though.

One bit that has not been mentioned yet, recompile HDF5 to be thread-safe using:
--enable-threadsafe --with-pthread=DIR
https://support.hdfgroup.org/HDF5/faq/threadsafe.html
I had some hard-to-find bugs in my keras code, which uses HDF5, and this was what solved it.

PyTables is not fully thread safe. Use multiprocess pools instead.

Related

python joblib returning `TypeError: cannot pickle 'weakref' object` with the same code, but different input data

I'm trying to parallelise code which converts strings to a third party package object using a function defined in that library. However, joblib is failing depending on the input data that I provide. Are the return types of the function important when using joblib?
To reproduce the issue:
First install third party libraries with:
pip install joblib music21
and download the data file test_input.abc (it's 4kb of text).
This code will run fine as a script:
from typing import List
import music21
from joblib import Parallel, delayed
def convert_string(string: str, format: str = "abc") -> music21.stream.Score:
return music21.converter.parse(string, format=format)
def convert_list_of_strings(
string_list,
n_jobs=-1,
prefer=None
) -> List[music21.stream.Score]:
return Parallel(n_jobs=n_jobs, prefer=prefer)(
delayed(convert_string)(string) for string in string_list
)
if __name__ == "__main__":
string_list = ['T:tune\nM:3/4\nL:1/8\nK:C\nab cd ef|GA BC DE' for _ in range(1000)]
output = convert_list_of_strings(string_list)
print(output)
i.e. it returns a list of music21.stream.Score objects.
However, if you change the main call to read the attached file, i.e.:
if __name__ == "__main__":
filepath = "test_input.abc"
tune_sep = "\n\n"
with open(filepath, "r") as file_object:
string_list = file_object.read().strip().split(tune_sep)
output = convert_list_of_strings(string_list)
print(output)
This will return the following error:
joblib.externals.loky.process_executor._RemoteTraceback:
"""
Traceback (most recent call last):
File "/path/to/venv/lib/python3.8/site-packages/joblib/externals/loky/process_executor.py", line 356, in _sendback_result
result_queue.put(_ResultItem(work_id, result=result,
File "/path/to/venv/lib/python3.8/site-packages/joblib/externals/loky/backend/queues.py", line 241, in put
obj = dumps(obj, reducers=self._reducers)
File "/path/to/venv/lib/python3.8/site-packages/joblib/externals/loky/backend/reduction.py", line 271, in dumps
dump(obj, buf, reducers=reducers, protocol=protocol)
File "/path/to/venv/lib/python3.8/site-packages/joblib/externals/loky/backend/reduction.py", line 264, in dump
_LokyPickler(file, reducers=reducers, protocol=protocol).dump(obj)
File "/path/to/venv/lib/python3.8/site-packages/joblib/externals/cloudpickle/cloudpickle_fast.py", line 563, in dump
return Pickler.dump(self, obj)
TypeError: cannot pickle 'weakref' object
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "joblib_test.py", line 50, in <module>
output = convert_list_of_strings(string_list)
File "joblib_test.py", line 39, in convert_list_of_strings
return Parallel(n_jobs=n_jobs, prefer=prefer)(
File "/path/to/venv/lib/python3.8/site-packages/joblib/parallel.py", line 1054, in __call__
self.retrieve()
File "/path/to/venv/lib/python3.8/site-packages/joblib/parallel.py", line 933, in retrieve
self._output.extend(job.get(timeout=self.timeout))
File "/path/to/venv/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 542, in wrap_future_result
return future.result(timeout=timeout)
File "/path/to/venv/lib/python3.8/concurrent/futures/_base.py", line 444, in result
return self.__get_result()
File "/path/to/venv/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
raise self._exception
TypeError: cannot pickle 'weakref' object
What I've tried to fix
Checked that the code runs fine when run without parallelism i.e. set n_jobs=1
Changed joblib's backend to use threads i.e. set prefer="threads" - this fixes things, i.e. it will not error - but I don't want to use threads!
Tried serializing the output of the function, i.e. changing convert_string to:
def convert_string(string: str, format: str = "abc") -> str:
return music21.converter.freezeStr(music21.converter.parse(string, format=format))
this also means the code will run...but now I have thousands of objects I need to deserialize!
Checked the input data type is all the same when reading from the file, and the first method (i.e. it's a list of strings)
Checked the output data type is all the same when reading from the file, and the first method (i.e. it's a list of music21.stream.Score)
Faffed about with #wrap_non_pickleable_objects
So I'm guessing that the content of the music21.stream.Score in the output is causing issue?
Change music21.sites.WEAKREF_ACTIVE = False (you might need to edit music21/sites.py directly) and music21 won't use any weak references. They'll probably disappear in v8 anyhow (or maybe even sooner since they're mostly an implementation detail). They were needed for running music21 in the Pre-Python2.6 circular reference counting era, but they're not really necessary anymore.
However, you're not going to get much of a speedup in your code, because the process of serializing and deserializing the Stream for passing across the multiprocessing worker-core->controller-core boundary will usually take as long as parsing the file itself, if not more. I can't find where I wrote it at some point but there's a guide to parallel running music21 that suggests do all your stream parsing in the worker-core and only pass back small data structures (counts of numbers of notes, etc.), not the whole score.
Oh, for some of these things, music21's common.parallel library (which uses joblib) will help make common tasks easier:
https://web.mit.edu/music21/doc/moduleReference/moduleCommonParallel.html

Is multiprocessing.Pool not allowed in Airflow task? - AssertionError: daemonic processes are not allowed to have children

Our airflow project has a task that queries from BigQuery and uses Pool to dump in parallel to local JSON files:
def dump_in_parallel(table_name):
base_query = f"select * from models.{table_name}"
all_conf_ids = range(1,10)
n_jobs = 4
with Pool(n_jobs) as p:
p.map(partial(dump_conf_id, base_query = base_query), all_conf_ids)
with open("/tmp/final_output.json", "wb") as f:
filenames = [f'/tmp/output_file_{i}.json' for i in all_conf_ids]
This task was working fine for us in airflow v1.10, but is no longer working in v2.1+. Section 2.1 here - https://blog.mbedded.ninja/programming/languages/python/python-multiprocessing/ - mentions "If you try and create a Pool from within a child worker that was already created with a Pool, you will run into the error: daemonic processes are not allowed to have children"
Here is the full Airflow error:
[2021-08-22 02:11:53,064] {taskinstance.py:1462} ERROR - Task failed with exception
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1164, in _run_raw_task
self._prepare_and_execute_task_with_callbacks(context, task)
File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1282, in _prepare_and_execute_task_with_callbacks
result = self._execute_task(context, task_copy)
File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1312, in _execute_task
result = task_copy.execute(context=context)
File "/usr/local/lib/python3.7/site-packages/airflow/operators/python.py", line 150, in execute
return_value = self.execute_callable()
File "/usr/local/lib/python3.7/site-packages/airflow/operators/python.py", line 161, in execute_callable
return self.python_callable(*self.op_args, **self.op_kwargs)
File "/usr/local/airflow/plugins/tasks/bigquery.py", line 249, in dump_in_parallel
with Pool(n_jobs) as p:
File "/usr/local/lib/python3.7/multiprocessing/context.py", line 119, in Pool
context=self.get_context())
File "/usr/local/lib/python3.7/multiprocessing/pool.py", line 176, in __init__
self._repopulate_pool()
File "/usr/local/lib/python3.7/multiprocessing/pool.py", line 241, in _repopulate_pool
w.start()
File "/usr/local/lib/python3.7/multiprocessing/process.py", line 110, in start
'daemonic processes are not allowed to have children'
AssertionError: daemonic processes are not allowed to have children
If it matters, we run airflow using the LocalExecutor. Any idea why this task that uses Pool would have been working in airflow v1.10 but no longer in airflow 2.1?
Airflow 2 uses different processing model under the hood to speed up processing, yet to maintain process-based isolation between running tasks.
That's why it uses forking and multiprocessing under the hook to run Tasks, but this also means that if you are using multiprocessing, you will hit the limits of Python multiprocessing that does not allow to chain multi-processing.
I am not 100% sure if it will work but you might try to set execute_tasks_new_python_interpreter configuration to True. https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#execute-tasks-new-python-interpreter . This setting will cause airflow to start a new Python interpreter when running task instead of forking/using multiprocessing (though I am not 100% sure of the latter). It will work quite a bit slower (up to a few seconds of overhead) though to run your task as the new Python interpreter will have to reinitialize and import all the airflow code before running your task.
If that does not work, then you can lunch your multiprocessing job using PythonVirtualenvOperator - that one will launch a new Python interpreter to run your python code and you should be able to use multiprocessing.
Replacing the multiprocessing library with billiard library works, per https://github.com/celery/celery/issues/4525. We have no idea why subbing one library in for the other resolves this issue though...
You can switch to joblib python library with lock backend and kill daemonic processes after execution with following instructions.

Using Dask LocalCluster() within a modular python codebase

I am trying to use Dask Distributed's LocalCluster to run code in parallel using all the cores of a single machine.
Consider a sample python data pipeline, with the folder structure below.
sample_dask_program
├── main.py
├── parallel_process_1.py
├── parallel_process_2.py
├── process_1.py
├── process_2.py
└── process_3.py
main.py is the entry point, which executes while pipeline sequentially.
Eg:
def run_pipeline():
stage_one_run_util()
stage_two_run_util()
...
stage_six_run_util()
if __name__ == '__main__':
...
run_pipeline()
parallel_process_1.py and parallel_process_2.py are modules which create a Client() and use futures to achieve parallelism.
with Client() as client:
# list to store futures after they are submitted
futures = []
for item in items:
future = client.submit(
...
)
futures.append(future)
results = client.gather(futures)
process_1.py, process_2.py and process_3.py are modules which do simple computation that need not be run in parallel using all the CPU cores.
Traceback:
File "/sm/src/calculation/parallel.py", line 140, in convert_qty_to_float
results = client.gather(futures)
File "/home/iouser/.local/lib/python3.7/site-packages/distributed/client.py", line 1894, in gather
asynchronous=asynchronous,
File "/home/iouser/.local/lib/python3.7/site-packages/distributed/client.py", line 778, in sync
self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
File "/home/iouser/.local/lib/python3.7/site-packages/distributed/utils.py", line 348, in sync
raise exc.with_traceback(tb)
File "/home/iouser/.local/lib/python3.7/site-packages/distributed/utils.py", line 332, in f
result[0] = yield future
File "/home/iouser/.local/lib/python3.7/site-packages/tornado/gen.py", line 735, in run
value = future.result()
concurrent.futures._base.CancelledError
This is the error thrown by the workers:
distributed.worker - ERROR - failed during get data with tcp://127.0.0.1:33901 -> tcp://127.0.0.1:38821
Traceback (most recent call last):
File "/home/iouser/.local/lib/python3.7/site-packages/distributed/comm/tcp.py", line 248, in write
future = stream.write(frame)
File "/home/iouser/.local/lib/python3.7/site-packages/tornado/iostream.py", line 546, in write
self._check_closed()
File "/home/iouser/.local/lib/python3.7/site-packages/tornado/iostream.py", line 1035, in _check_closed
raise StreamClosedError(real_error=self.error)
tornado.iostream.StreamClosedError: Stream is closed
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/iouser/.local/lib/python3.7/site-packages/distributed/worker.py", line 1248, in get_data
compressed = await comm.write(msg, serializers=serializers)
File "/home/iouser/.local/lib/python3.7/site-packages/distributed/comm/tcp.py", line 255, in write
convert_stream_closed_error(self, e)
File "/home/iouser/.local/lib/python3.7/site-packages/distributed/comm/tcp.py", line 121, in convert_stream_closed_error
raise CommClosedError("in %s: %s: %s" % (obj, exc.__class__.__name__, exc))
distributed.comm.core.CommClosedError: in <closed TCP>: BrokenPipeError: [Errno 32] Broken pipe
I am not able to locally reproduce this error or find a minimum reproducible example, as the occurrence of this error is abrupt.
Is this the right way to use Dask LocalCluster in a modular python program?
EDIT
I have observed that these errors come up when the LocalCluster is created with a relatively high number of threads and processes. I am doing computations which uses NumPy and Pandas and this is not a good practice as described here.
At times, when the LocalCluster is created using 4 workers and 16 processes, no error gets thrown. When the LocalCluster is created using 8 workers and 40 processes, the error I described above gets thrown.
As far as I understand, dask randomly selects this combination (is this an issue with dask?), as I tested on the same AWS Batch instance (with 8 cores (16 vCPUs)).
The issue does not pop up when I forcefully create the cluster with only threads.
Eg:
cluster = LocalCluster(processes=False)
with Client(cluster) as client:
client.submit(...)
...
But, creating the LocalCluster using only threads slows down the execution by 2-3 times.
So, is the solution to the problem, finding the right number of processes/threads suitable to the program?
It is more common to create a Dask Client once, and then run many workloads on it.
with Client() as client:
stage_one(client)
stage_two(client)
That being said, what you're doing should be fine. If you're able to reproduce the error with a minimal example, that would be useful (but no expectations).

Error in Python multiprocessing process

I am trying a write a python code having multiple processes whose structure and flow is something like this:
import multiprocessing
import ctypes
import time
import errno
m=multiprocessing.Manager()
mylist=m.list()
var1=m.Value('i',0)
var2=m.Value('i',1)
var3=m.Value('i',2)
var4=m.Value(ctypes.c_char_p,"a")
var5=m.Value(ctypes.c_char_p,"b")
var6=3
var7=4
var8=5
var9=6
var10=7
def func(var1,var2,var4,var5,mylist):
i=0
try:
if var1.value==0:
print var2.value,var4.value,var5.value
mylist.append(time.time())
elif var1.value==1:
i=i+2
print var2.value+2,var4.value,var5.value
mylist.append(time.time())
except IOError as e:
if e.errno==errno.EPIPE:
var3.value=var3.value+1
print "Error"
def work():
for i in range(var3.value):
print i,var6,var7,va8,var9,var10
p=multiprocessing.Process(target=func,args=(var1,var2,var4,var5,mylist))
p.start()
work()
When I run this code, sometimes it works perfectly, sometimes it does not run for exact amount of loop counts and sometimes I get following error:
0
1
Process Process-2:
Traceback (most recent call last):
File "/usr/lib64/python2.6/multiprocessing/process.py", line 232, in _bootstrap
self.run()
File "/usr/lib64/python2.6/multiprocessing/process.py", line 88, in run
self._target(*self._args, **self._kwargs)
File "dummy.py", line 19, in func
if var1.value==0:
File "/usr/lib64/python2.6/multiprocessing/managers.py", line 1005, in get
return self._callmethod('get')
File "/usr/lib64/python2.6/multiprocessing/managers.py", line 722, in _callmethod
self._connect()
File "/usr/lib64/python2.6/multiprocessing/managers.py", line 709, in _connect
conn = self._Client(self._token.address, authkey=self._authkey)
File "/usr/lib64/python2.6/multiprocessing/connection.py", line 149, in Client
answer_challenge(c, authkey)
File "/usr/lib64/python2.6/multiprocessing/connection.py", line 383, in answer_challenge
message = connection.recv_bytes(256) # reject large message
EOFError
What does this error mean? What wrong am I doing here? What this error indicates? Kindly guide me to the correct path. I am using CentOS 6.5
Working with shared variables in multiprocessing is tricky. Because of the python Global Interpreter Lock (GIL), multiprocessing is not directly possible in Python. When you use the multiprocessing module, you can launch several task on different process, BUT you can't share the memory.
In you case, you need this so you try to use shared memory. But what happens here is that you have several processes trying to read the same memory at the same time. To avoid memory corruption, a process lock the memory address it is currently reading, forbidding other processes to access it until it finishes reading.
Here you have 3 processes trying to evaluate var1.value in the first if loop of your func : the first process read the value, and the other are blocked, raising an error.
To avoid this mechanism, you should always manage the Lock of your shared variables yourself.
You can try with syntax:
var1=multiprocessing.Value('i',0) # create shared variable
var1.acquire() # get the lock : it will wait until lock is available
var1.value # read the value
var1.release() # release the lock
External documentation :
Locks : https://docs.python.org/2/librar/multiprocessing.html#synchronization-between-processes
GIL : https://docs.python.org/2/glossary.html#term-global-interpreter-lock

passing file instance as an argument to the celery task raises "ValueError: I/O operation on closed file"

I need to pass file as argument to the celery task, but the passed file somehow got there closed. It happens just in case I'm executing the task asynchronous way.
Is this an expected behavior?
views:
from engine.tasks import s3_upload_handler
def myfunc():
f = open('/app/uploads/pic.jpg', 'rb')
s3_file_handler.apply_async(kwargs={"uploaded_file" : f,"file_name" : "test.jpg"})
tasks:
def s3_upload_handler(uploaded_file,file_name):
...
#some code for uploading to s3
traceback:
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/celery/app/trace.py", line 240, in trace_task
R = retval = fun(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/celery/app/trace.py", line 437, in __protected_call__
return self.run(*args, **kwargs)
File "/app/photohosting/engine/tasks.py", line 34, in s3_upload_handler
key.set_contents_from_file(uploaded_file)
File "/usr/local/lib/python2.7/dist-packages/boto/s3/key.py", line 1217, in set_contents_from_file
spos = fp.tell()
ValueError: I/O operation on closed file
flower logs:
kwargs {
'file_name': 'test.jpg',
'uploaded_file': <closed file '<uninitialized file>',
mode '<uninitialized file>' at 0x7f6ab9e75e40>
}
Yes, of course, the file would get there closed. Asynchronous celery tasks run in a completely separate process (moreover, they can even run on a different machine) and there is no way to pass an open file to it.
You should close the file in the process from where you call the task, and then pass its name and maybe position in file (if you need it) to the task and then reopen it in the task.
Another way of doing this would be to open the file and get a binary blob which you transfer over the wire. Of course if the file is really large then what #Vasily says is better but wont work in case of the worker running on a different m/c (unless your file is on a shared storage).

Categories