"Dictionary size changed during iteration" from Pebble ProcessPool - python

We've some parallel processing code which is built around Pebble, it's been working robustly for quite some time but we seem to have run into some odd edge-case.
Based on the exception trace (and the rock-simple code feeding it) I suspect that it's actually a bug in Pebble but who knows.
The code feeding the process pool is pretty trivial:
pool = ProcessPool(max_workers=10, max_tasks=10)
for path in filepaths:
try:
future = pool.schedule(function=self.analyse_file, args(path), timeout=30)
future.add_done_callback(self.process_result)
exception Exception as e:
print("Exception fired:" + e) # NOT where the exception is firing
pool.close()
pool.join()
So in essence, we schedule a bunch of stuff to run, close out the pool then wait for the pool to complete the scheduled tasks. NOTE: the exception is not being thrown in the schedule loop, it gets fired AFTER we call join().
This is the exception stack trace:
Traceback (most recent call last):
File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/home/user/.pyenv/versions/scrapeapp/lib/python3.6/site-packages/pebble/pool/process.py", line 150, in task_scheduler_loop
pool_manager.schedule(task)
File "/home/user/.pyenv/versions/scrapeapp/lib/python3.6/site-packages/pebble/pool/process.py", line 198, in schedule
self.worker_manager.dispatch(task)
File "/home/user/.pyenv/versions/scrapeapp/lib/python3.6/site-packages/pebble/pool/process.py", line 327, in dispatch
self.pool_channel.send(WorkerTask(task.id, task.payload))
File "/home/user/.pyenv/versions/scrapeapp/lib/python3.6/site-packages/pebble/pool/channel.py", line 66, in send
return self.writer.send(obj)
File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/multiprocessing/connection.py", line 206, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
RuntimeError: dictionary changed size during iteration
I think it's got to be some weird race condition, as the code will work flawlessly on some data sets but fail at what appears to be a random point on another dataset.
We were using pebble 4.3.1 when we first ran into the issue (same version we'd had since the beginning), tried upgrading to 4.5.0, no change.
Has anybody run into similar issues with Pebble in the past? If so what was your fix?

Related

Dask Distributed: Getting some errors after computations

I am running Dask Distributed on Linux CentOS 7, with a Python 3.6.2 installation. My computation seems to be getting fine (I am still improving my code, but I am able to have some results), but I keep getting some python errors apparently linked to tornado module. I am only launching a one node standalone Dask distributed cluster.
Here is the most common example:
Exception in thread Client loop:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/usr/local/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.6/site-packages/tornado/ioloop.py", line 832, in start
self._run_callback(self._callbacks.popleft())
AttributeError: 'NoneType' object has no attribute 'popleft'
And here is another one:
tornado.application - ERROR - Exception in callback <bound method WorkStealing.balance of <distributed.stealing.WorkStealing object at 0x7f752ce6d6a0>>
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/tornado/ioloop.py", line 1026, in _run
return self.callback()
File "/usr/local/lib/python3.6/site-packages/distributed/stealing.py", line 248, in balance
sat = s.rprocessing[key]
KeyError: 'read-block-9024000000-e3fefd2110094168cc0505db69b326e0'
Do you have any idea why? Should I close some connections or stop the standalone cluster?
Yes, if you don't close down the Tornado IOLoop before exiting the process then it can die in an unpleasant way. Fortunately this shouldn't affect your application, except by looking unpleasant.
You might submit a bug report about this, it's still something that we should fix.

Cement framework receive signal 15 on pool worker close

I'm experiencing a problem with the Cement framework for python (using python3 at the moment). I have a multiprocess application which uses python's Pool workers. A the end (it deos not interfere with the results) of every multiporcessing section my stdout is filled with one or more of these exceptions:
Traceback (most recent call last):
File "/usr/lib/python3.5/multiprocessing/util.py", line 254, in _run_finalizers
finalizer()
File "/usr/lib/python3.5/multiprocessing/util.py", line 186, in __call__
res = self._callback(*self._args, **self._kwargs)
File "/usr/lib/python3.5/multiprocessing/queues.py", line 198, in _finalize_join
thread.join()
File "/usr/lib/python3.5/threading.py", line 1054, in join
self._wait_for_tstate_lock()
File "/usr/lib/python3.5/threading.py", line 1070, in _wait_for_tstate_lock
elif lock.acquire(block, timeout):
File "/home/yogaub/.virtualenvs/seminar/lib/python3.5/site-packages/cement/core/foundation.py", line 123, in cement_signal_handler
raise exc.CaughtSignal(signum, frame)
cement.core.exc.CaughtSignal: Caught signal 15
Does anyone know why this happens, and how to prevent it?
Thanks
edit: I should add that i'm logging with the multiprocess logging system of this question. I don't really know if there is any correlation.
edit2: This is the process pool creation and termination:
pool = Pool(processes=core_num)
pool.map(worker_unpacker.work, formatted_input)
pool.close()
t2 = time.time()
I've tried catching sigterm with Cement's hook system but it doesn't work. The only solution I found at the moment is to actually completely ignore signals in the cement app configuration (but it is not really a solution I like..).
This is an educated guess: The parent process kills (terminate()s) the started processes on exit. If you call pool.join() in the parent process, then the parent process waits until all sub processes are finished and will not send SIGTERM to them.

Convert a multi-threaded Python to a multi-process one using concurrent futures

I have the following working code (Python 3.5) which uses concurrent futures to parse files in a threaded manner, and then do some post-processing on the results when they come back (in any order).
from concurrent import futures
with futures.ThreadPoolExecutor(max_workers=4) as executor:
# A dictionary which will contain a list the future info in the key, and the filename in the value
jobs = {}
# Loop through the files, and run the parse function for each file, sending the file-name to it, along with the kwargs of parser_variables.
# The results of the functions can come back in any order.
for this_file in files_list:
job = executor.submit(parse_log_file.parse, this_file, **parser_variables)
jobs[job] = this_file
# Get the completed jobs whenever they are done
for job in futures.as_completed(jobs):
debug.checkpointer("Multi-threaded Parsing File finishing")
# Send the result of the file the job is based on (jobs[job]) and the job (job.result)
result_content = job.result()
this_file = jobs[job]
I want to convert this to use processes instead of threads because threads don't offer any speedup. In theory I just need to change ThreadPoolExecutor into ProcessPoolExecutor.
The problem is, if I do that I get this exception:
Process Process-2:
Traceback (most recent call last):
File "C:\Python35\lib\multiprocessing\process.py", line 254, in _bootstrap
self.run()
File "C:\Python35\lib\multiprocessing\process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "C:\Python35\lib\concurrent\futures\process.py", line 169, in _process_worker
call_item = call_queue.get(block=True)
File "C:\Python35\lib\multiprocessing\queues.py", line 113, in get
return ForkingPickler.loads(res)
TypeError: Required argument 'fileno' (pos 1) not found
Traceback (most recent call last):
File "c:/myscript/main.py", line 89, in <module>
main()
File "c:/myscript/main.py", line 59, in main
system_counters = process_system(system, filename)
File "c:\myscript\per_system.py", line 208, in process_system
system_counters = process_filelist(**file_handling_variables)
File "c:\myscript\per_logfile.py", line 31, in process_filelist
results_list = job.result()
File "C:\Python35\lib\concurrent\futures\_base.py", line 398, in result
return self.__get_result()
File "C:\Python35\lib\concurrent\futures\_base.py", line 357, in __get_result
raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
I think that this might have something to do with pickling, but googling for the error hasn't found anything.
How do I convert the above to use multiple processes?
It turns out this is because one of the things I'm passing inside parser_variables is a class (a reader from a third-party module). If I remove the class, the above works fine.
For whatever reason, pickle doesn't seem to be able to handle this particular object.

Error in Python multiprocessing process

I am trying a write a python code having multiple processes whose structure and flow is something like this:
import multiprocessing
import ctypes
import time
import errno
m=multiprocessing.Manager()
mylist=m.list()
var1=m.Value('i',0)
var2=m.Value('i',1)
var3=m.Value('i',2)
var4=m.Value(ctypes.c_char_p,"a")
var5=m.Value(ctypes.c_char_p,"b")
var6=3
var7=4
var8=5
var9=6
var10=7
def func(var1,var2,var4,var5,mylist):
i=0
try:
if var1.value==0:
print var2.value,var4.value,var5.value
mylist.append(time.time())
elif var1.value==1:
i=i+2
print var2.value+2,var4.value,var5.value
mylist.append(time.time())
except IOError as e:
if e.errno==errno.EPIPE:
var3.value=var3.value+1
print "Error"
def work():
for i in range(var3.value):
print i,var6,var7,va8,var9,var10
p=multiprocessing.Process(target=func,args=(var1,var2,var4,var5,mylist))
p.start()
work()
When I run this code, sometimes it works perfectly, sometimes it does not run for exact amount of loop counts and sometimes I get following error:
0
1
Process Process-2:
Traceback (most recent call last):
File "/usr/lib64/python2.6/multiprocessing/process.py", line 232, in _bootstrap
self.run()
File "/usr/lib64/python2.6/multiprocessing/process.py", line 88, in run
self._target(*self._args, **self._kwargs)
File "dummy.py", line 19, in func
if var1.value==0:
File "/usr/lib64/python2.6/multiprocessing/managers.py", line 1005, in get
return self._callmethod('get')
File "/usr/lib64/python2.6/multiprocessing/managers.py", line 722, in _callmethod
self._connect()
File "/usr/lib64/python2.6/multiprocessing/managers.py", line 709, in _connect
conn = self._Client(self._token.address, authkey=self._authkey)
File "/usr/lib64/python2.6/multiprocessing/connection.py", line 149, in Client
answer_challenge(c, authkey)
File "/usr/lib64/python2.6/multiprocessing/connection.py", line 383, in answer_challenge
message = connection.recv_bytes(256) # reject large message
EOFError
What does this error mean? What wrong am I doing here? What this error indicates? Kindly guide me to the correct path. I am using CentOS 6.5
Working with shared variables in multiprocessing is tricky. Because of the python Global Interpreter Lock (GIL), multiprocessing is not directly possible in Python. When you use the multiprocessing module, you can launch several task on different process, BUT you can't share the memory.
In you case, you need this so you try to use shared memory. But what happens here is that you have several processes trying to read the same memory at the same time. To avoid memory corruption, a process lock the memory address it is currently reading, forbidding other processes to access it until it finishes reading.
Here you have 3 processes trying to evaluate var1.value in the first if loop of your func : the first process read the value, and the other are blocked, raising an error.
To avoid this mechanism, you should always manage the Lock of your shared variables yourself.
You can try with syntax:
var1=multiprocessing.Value('i',0) # create shared variable
var1.acquire() # get the lock : it will wait until lock is available
var1.value # read the value
var1.release() # release the lock
External documentation :
Locks : https://docs.python.org/2/librar/multiprocessing.html#synchronization-between-processes
GIL : https://docs.python.org/2/glossary.html#term-global-interpreter-lock

Finding exception in python multiprocessing

I have a bit of python code that looks like this:
procs = cpu_count()-1
if serial or procs == 1:
results = map(do_experiment, experiments)
else:
pool = Pool(processes=procs)
results = pool.map(do_experiment, experiments)
It runs fine when I set the serial flag, but it gives the following error when the Pool is used. When I try to print something from do_experiment nothing shows up, so I can't try/catch there and print a stack trace.
Exception in thread Thread-2:
Traceback (most recent call last):
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 530, in __bootstrap_inner
self.run()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 483, in run
self.__target(*self.__args, **self.__kwargs)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/pool.py", line 285, in _handle_tasks
put(task)
TypeError: 'NoneType' object is not callable
What is a good way to proceed debugging this?
I went back in my git history until I found a commit where things were still working.
I added a class to my code that extends dict so that keys can be accessed with a . (so dict.foo in stead of dict["foo"]. Multiprocessing did not take kindly to this, using an ordinary dict solved the problem.

Categories