Python Pickle throwing FileNotFoundError - python

I have a situation in python where, in a python runnable object that is running together with other processes, the following situation happens: If the code is simply:
f = open(filename, "rb")
f.close()
There is no error, but then when the code changes to the following, introducing pickle in the middle, it throws a FileNotFoundError:
f = open(filename, "rb")
object = pickle.load(f)
f.close()
I don't understand why, if the file exists, pickle would throw such error. The complete trace of the error is:
task = pickle.load(f)
File "/usr/lib/python3.4/multiprocessing/managers.py", line 852, in RebuildProxy
return func(token, serializer, incref=incref, **kwds)
File "/usr/lib/python3.4/multiprocessing/managers.py", line 706, in __init__
self._incref()
File "/usr/lib/python3.4/multiprocessing/managers.py", line 756, in _incref
conn = self._Client(self._token.address, authkey=self._authkey)
File "/usr/lib/python3.4/multiprocessing/connection.py", line 495, in Client
c = SocketClient(address)
File "/usr/lib/python3.4/multiprocessing/connection.py", line 624, in SocketClient
s.connect(address)
FileNotFoundError: [Errno 2] No such file or directory

While I'd say that #TesselatingHeckler provided the answer, this comment doesn't really fit in the comments section… so it's an extension of that answer.
You can't pickle multiprocessing Queue and Pipe objects, and indeed, some pickled objects only fail on load.
I have built a fork of multiprocessing that allows most objects to be pickled. Primarily what was done was to replace pickle with a much more robust serializer (dill). The fork is available as part of the pathos package (on github). I've tried to serialize Pipes before, and it works, and it also works on a python socket and a python Queue. However, it still doesn't work apparently on a multiprocessing Queue.
>>> from processing import Pipe
>>> p = Pipe()
>>>
>>> import dill
>>> dill.loads(dill.dumps(p))
(Connection(handle=12), Connection(handle=14))
>>>
>>> from socket import socket
>>> s = socket()
>>> from Queue import Queue as que
>>> w = que()
>>> dill.loads(dill.dumps(s))
<socket._socketobject object at 0x10dae18a0>
>>> dill.loads(dill.dumps(w))
<Queue.Queue instance at 0x10db49f38>
>>>
>>> from processing import Queue
>>> q = Queue()
>>> dill.loads(dill.dumps(q))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/mmckerns/lib/python2.7/site-packages/dill-0.2.2.dev-py2.7.egg/dill/dill.py", line 180, in dumps
dump(obj, file, protocol, byref, file_mode, safeio)
File "/Users/mmckerns/lib/python2.7/site-packages/dill-0.2.2.dev-py2.7.egg/dill/dill.py", line 173, in dump
pik.dump(obj)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 224, in dump
self.save(obj)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 306, in save
rv = reduce(self.proto)
File "/Users/mmckerns/lib/python2.7/site-packages/processing/queue.py", line 62, in __getstate__
assertSpawning(self)
File "/Users/mmckerns/lib/python2.7/site-packages/processing/forking.py", line 24, in assertSpawning
'processes through inheritance' % type(self).__name__)
RuntimeError: Queue objects should only be shared between processes through inheritance
It seems that the __getstate__ method for a multiprocessing Queue is hardwired to throw an error. If you had this buried inside a class, it might not be triggered until the class instance is being reconstituted from the pickle.
Trying to pickle a multiprocessing Queue with pickle should also give you the above error.

Related

Read files with ZipFile using multiprocessing

I'm trying to read raw data from a zipfile. The structure of that file is:
zipfile
data
Spectral0.data
Spectral1.data
Spectral[...].data
Spectral300.data
Header
The goal is to read all Spectral[...].data into an 2D numpy array (whereas Spectral0.data would be the first column). The single threaded approach takes a lot of time since reading one .data file takes some seconds.
import zipfile
import numpy as np
spectralData = np.zeros(shape = (dimY, dimX), dtype=np.int16)
archive = zipfile.ZipFile(path, 'r')
for file in range(fileCount):
spectralDataRaw = archive.read('data/Spectral' + str(file) + '.data')
spectralData[:,file] = np.frombuffer(spectralDataRaw, np.short)
And I thought using multiprocessing could speed up the process. So I read some tutorials how to set up a multiprocessing procedure. This is what I came up with:
import zipfile
import numpy as np
import multiprocessing
from joblib import Parallel, delayed
archive = zipfile.ZipFile(path, 'r')
numCores = multiprocessing.cpu_count()
def testMult(file):
spectralDataRaw = archive.read('data/Spectral' + str(file) + '.data')
return np.frombuffer(spectralDataRaw, np.short)
output = Parallel(n_jobs=numCores)(delayed(testMult)(file)for file in range(fileCount))
output = np.flipud(np.rot90(np.array(output), 1, axes = (0,2)))
Using this approach I get the following error:
numCores = multiprocessing.cpu_count()
def testMult(file):
spectralDataRaw = archive.read('data/Spectral' + str(file) + '.data')
return np.frombuffer(spectralDataRaw, np.short)
output = Parallel(n_jobs=numCores)(delayed(testMult)(file)for file in range(fileCount))
output = np.flipud(np.rot90(np.array(output), 1, axes = (0,2)))
_RemoteTraceback:
"""
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\envs\devEnv2\lib\site-packages\joblib\externals\loky\backend\queues.py", line 153, in _feed
obj_ = dumps(obj, reducers=reducers)
File "C:\ProgramData\Anaconda3\envs\devEnv2\lib\site-packages\joblib\externals\loky\backend\reduction.py", line 271, in dumps
dump(obj, buf, reducers=reducers, protocol=protocol)
File "C:\ProgramData\Anaconda3\envs\devEnv2\lib\site-packages\joblib\externals\loky\backend\reduction.py", line 264, in dump
_LokyPickler(file, reducers=reducers, protocol=protocol).dump(obj)
File "C:\ProgramData\Anaconda3\envs\devEnv2\lib\site-packages\joblib\externals\cloudpickle\cloudpickle_fast.py", line 563, in dump
return Pickler.dump(self, obj)
TypeError: cannot pickle '_io.BufferedReader' object
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<ipython-input-94-c4b007eea8e2>", line 8, in <module>
output = Parallel(n_jobs=numCores)(delayed(testMult)(file)for file in range(fileCount))
File "C:\ProgramData\Anaconda3\envs\devEnv2\lib\site-packages\joblib\parallel.py", line 1061, in __call__
self.retrieve()
File "C:\ProgramData\Anaconda3\envs\devEnv2\lib\site-packages\joblib\parallel.py", line 940, in retrieve
self._output.extend(job.get(timeout=self.timeout))
File "C:\ProgramData\Anaconda3\envs\devEnv2\lib\site-packages\joblib\_parallel_backends.py", line 542, in wrap_future_result
return future.result(timeout=timeout)
File "C:\ProgramData\Anaconda3\envs\devEnv2\lib\concurrent\futures\_base.py", line 432, in result
return self.__get_result()
File "C:\ProgramData\Anaconda3\envs\devEnv2\lib\concurrent\futures\_base.py", line 388, in __get_result
raise self._exception
PicklingError: Could not pickle the task to send it to the workers.
My question is, how do I set up this parallelization correctly. I've read that zipfile is not thread safe and therefore I might need a different approach to read the zip content into memory(RAM). I would rather not read the whole zipfile into memory since the file can be quite large.
I thought about using from numba import njit, prange but there the problem occurs that zip is not supported by numba.
What else could I do to make this work?

Error using pexpect and multiprocessing? error "TypError: cannot serialize '_io.TextIOWrapper' object"

I have a Python 3.7 script on a Linux machine where I am trying to run a function in multiple threads, but when I try I receive the following error:
Traceback (most recent call last):
File "./test2.py", line 43, in <module>
pt.ping_scanx()
File "./test2.py", line 39, in ping_scanx
par = Parallel(function=self.pingx, parameter_list=list, thread_limit=10)
File "./test2.py", line 19, in __init__
self._x = self._pool.starmap(function, parameter_list, chunksize=1)
File "/usr/local/lib/python3.7/multiprocessing/pool.py", line 276, in starmap
return self._map_async(func, iterable, starmapstar, chunksize).get()
File "/usr/local/lib/python3.7/multiprocessing/pool.py", line 657, in get
raise self._value
File "/usr/local/lib/python3.7/multiprocessing/pool.py", line 431, in _handle_tasks
put(task)
File "/usr/local/lib/python3.7/multiprocessing/connection.py", line 206, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File "/usr/local/lib/python3.7/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
TypeError: cannot serialize '_io.TextIOWrapper' object
This is the sample code that I am using to demonstrate the issue:
#!/usr/local/bin/python3.7
from multiprocessing import Pool
import pexpect # Used to run SSH for sessions
class Parallel:
def __init__(self, function, parameter_list, thread_limit=4):
# Create new thread to hold our jobs
self._pool = Pool(processes=thread_limit)
self._x = self._pool.starmap(function, parameter_list, chunksize=1)
class PingTest():
def __init__(self):
self._pex = None
def connect(self):
self._pex = pexpect.spawn("ssh snorton#127.0.0.1")
def pingx(self, target_ip, source_ip):
print("PING {} {}".format(target_ip, source_ip))
def ping_scanx(self):
self.connect()
list = [['8.8.8.8', '96.53.16.93'],
['8.8.8.8', '96.53.16.93']]
par = Parallel(function=self.pingx, parameter_list=list, thread_limit=10)
pt = PingTest()
pt.ping_scanx()
If I don't include the line with pexpect.spawn, the error doesn't happen. Can someone explain why I am getting the error, and suggest a way to fix it?
With multiprocessing.Pool you're actually calling the function as separate processes, not threads. Processes cannot share Python objects unless they are serialized first before transmitting them to each other via inter-processing communication channels, which is what multiprocessing.Pool does for you behind the scene using pickle as the serializer. Since pexpect.spawn opens a terminal device as a file-like TextIOWrapper object, and you're storing the returning object in the PingTest instance and then passing the bound method self.pingx to Pool.starmap, it will try to serialize self, which contains the pexpect.spawn object in the _pex attribute that unfortunately cannot be serialized because TextIOWrapper does not support serialization.
Since your function is I/O-bound, you should use threading instead via the multiprocessing.dummy module for more efficient parallelization and more importantly in this case, to allow the pexpect.spawn object to be shared across the threads, with no need for serialization.
Change:
from multiprocessing import Pool
to:
from multiprocessing.dummy import Pool
Demo: https://repl.it/#blhsing/WiseYoungExperiments

Python not able to pickle ZabbixAPI class instance

I'm not able to pickle a pyzabbix.ZabbixAPI class instance with the following code:
from pyzabbix import ZabbixAPI
from pickle import dumps
api = ZabbixAPI('http://platform.autuitive.com/monitoring/')
print dumps(api)
It results in the following error:
Traceback (most recent call last):
File "zabbix_test.py", line 8, in <module>
print dumps(api)
File "/usr/lib/python2.7/pickle.py", line 1374, in dumps
Pickler(file, protocol).dump(obj)
File "/usr/lib/python2.7/pickle.py", line 224, in dump
self.save(obj)
File "/usr/lib/python2.7/pickle.py", line 306, in save
rv = reduce(self.proto)
File "/usr/lib/python2.7/copy_reg.py", line 84, in _reduce_ex
dict = getstate()
TypeError: 'ZabbixAPIObjectClass' object is not callable
Well, in the documentation it says:
instances of such classes whose __dict__ or the result of calling
__getstate__() is picklable (see section The pickle protocol for details).
And it would appear this class isn't one. If you really need to do this, then consider writing your own pickling routines.

pytables crash with threads

The following code shows a problem in the interaction between pytables and threading. I'm creating an HDF file and reading it with 100 concurrent threads:
import threading
import pandas as pd
from pandas.io.pytables import HDFStore, get_store
filename='test.hdf'
with get_store(filename,mode='w') as store:
store['x'] = pd.DataFrame({'y': range(10000)})
def process(i,filename):
# print 'start', i
with get_store(filename,mode='r') as store:
df = store['x']
# print 'end', i
return df['y'].max
threads = []
for i in range(100):
t = threading.Thread(target=process, args = (i,filename,))
t.daemon = True
t.start()
threads.append(t)
for t in threads:
t.join()
The program usually executes cleanly. But now and then I get exceptions like this:
Exception in thread Thread-27:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 504, in run
self.__target(*self.__args, **self.__kwargs)
File "crash.py", line 13, in process
with get_store(filename,mode='r') as store:
File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
return self.gen.next()
File "/usr/local/lib/python2.7/dist-packages/pandas/io/pytables.py", line 259, in get_store
store = HDFStore(path, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/pytables.py", line 398, in __init__
self.open(mode=mode, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/pytables.py", line 528, in open
self._handle = tables.openFile(self._path, self._mode, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tables/_past.py", line 35, in oldfunc
return obj(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tables/file.py", line 298, in open_file
for filehandle in _open_files.get_handlers_by_name(filename):
RuntimeError: Set changed size during iteration
or
[...]
File "/usr/local/lib/python2.7/dist-packages/tables/_past.py", line 35, in oldfunc
return obj(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tables/file.py", line 299, in open_file
omode = filehandle.mode
AttributeError: 'File' object has no attribute 'mode'
While reducing the code I got very different error messages, some of them indicating memory corruption.
Here are my library versions:
>>> pd.__version__
'0.13.1'
>>> tables.__version__
'3.1.0'
I already have had an error with threads which occured in writing files and I solved it by recompiling hdf5 with options: --enable-threadsafe --with-pthread
Can anyone reproduce the problem? How to solve it?
Anthony already pointed out that hdf5 (PyTables is basically a wrapper around the hdf5 C library) is not thread-safe. If you want to access an hdf5 file from a web application, you have basically two options:
Use a dedicated process that handles all the hdf5 I/O. Processes/threads of the web application must communicate with this process through, e.g., Unix Domain Sockets. The downside of this approach — obviously — is that it scales very badly. If one web request is accessing the hdf5 file, all other requests must wait.
Implement a read-write locking mechanism that allows concurrent reading, but uses an exclusive lock for writing. Cf. http://en.wikipedia.org/wiki/Readers-writers_problem.
Note that with a mod_wsgi application — depending on the configuration — you have to deal with threads and processes!
I am also currently struggling with using hdf5 as a database backend for a web application. I think the 2nd approach above provides a decent solution. But still, hdf5 is not a database system. If you want a real array database server with a Python interface, have a look at http://www.scidb.org. It is not nearly as light-weight as an hdf5-based solution, though.
One bit that has not been mentioned yet, recompile HDF5 to be thread-safe using:
--enable-threadsafe --with-pthread=DIR
https://support.hdfgroup.org/HDF5/faq/threadsafe.html
I had some hard-to-find bugs in my keras code, which uses HDF5, and this was what solved it.
PyTables is not fully thread safe. Use multiprocess pools instead.

How to do multiprocessing with class instances

The program is designed to set up a process creation listener on various ips on the network. The code is:
import multiprocessing
from wmi import WMI
dynaIP = ['192.168.165.1','192.168.165.2','192.168.165.3','192.168.165.4',]
class WindowsMachine:
def __init__(self, ip):
self.ip = ip
self.connection = WMI(self.ip)
self.created_process = multiprocessing.Process(target = self.monitor_created_process, args = (self.connection,))
self.created_process.start()
def monitor_created_process(self, remote_pc):
while True:
created_process = remote_pc.Win32_Process.watch_for("creation")
print('Creation:',created_process.Caption, created_process.ProcessId, created_process.CreationDate)
return created_process
if __name__ == '__main__':
for ip in dynaIP:
print('Running', ip)
WindowsMachine(ip)
When running the code I get the following error:
Traceback (most recent call last):
File "U:/rmarshall/Work For Staff/ROB/_Python/__Python Projects Code/multipro_instance_stack_question.py", line 26, in <module>
WindowsMachine(ip)
File "U:/rmarshall/Work For Staff/ROB/_Python/__Python Projects Code/multipro_instance_stack_question.py", line 14, in __init__
self.created_process.start()
File "C:\Python33\lib\multiprocessing\process.py", line 111, in start
self._popen = Popen(self)
File "C:\Python33\lib\multiprocessing\forking.py", line 248, in __init__
dump(process_obj, to_child, HIGHEST_PROTOCOL)
File "C:\Python33\lib\multiprocessing\forking.py", line 166, in dump
ForkingPickler(file, protocol).dump(obj)
_pickle.PicklingError: Can't pickle <class 'PyIID'>: attribute lookup builtins.PyIID failed
I have looked at other questions surrounding this issue but none I feel have clearly explained the work-around for pickling class instances.
Is anyone able to demonstrate this?
The problem here is that multiprocessing pickles the arguments to the process in order to pass them around. The WMI class is not pickleable, so cannot be passed as an argument to multiprocessing.Process.
If you want this to work, you can either:
switch to using threads instead of processes (see the threading module)
create the WMI object in monitor_created_process
I'd recommend the former, because there doesn't seem to be much use in creating full-blown processes.

Categories