ThreadPoolExecutor hangs when call futur result in a futur callback - python

I am using "requests-futures" package and call asynchronous get/post in asynchronous get/post result callback (add_done_callback on futur result). Sometimes, my code hangs. After many investigation hours, I can reproduce the lock with a minimal code:
from concurrent.futures import ThreadPoolExecutor
import time
pool = ThreadPoolExecutor(max_workers=10)
def f(_):
time.sleep(0.1) # Try to force context switch
x = pool.submit(lambda: None)
print "1"
x.result()
print "2"
def main():
x = pool.submit(lambda : None)
x.add_done_callback(f)
print "3"
x.result()
print "4"
print "==="
main()
If I run this peace of code in a bash loop:
$> while true; do python code.py; done;
The program hangs every times with the "trace":
(...)
===
1
2
3
4
===
3
4
1
If I break it with ctrl^c, I have the following stack trace:
^CError in atexit._run_exitfuncs:
Traceback (most recent call last):
File "/usr/lib/python2.7/atexit.py", line 24, in _run_exitfuncs
func(*targs, **kargs)
File "/home/yienyien/Angus/test/futur/env/local/lib/python2.7/site-
packages/concurrent/futures/thread.py", line 46, in _python_exit
t.join(sys.maxint)
File "/usr/lib/python2.7/threading.py", line 951, in join
self.__block.wait(delay)
File "/usr/lib/python2.7/threading.py", line 359, in wait
_sleep(delay)
KeyboardInterrupt
Error in sys.exitfunc:
Traceback (most recent call last):
File "/usr/lib/python2.7/atexit.py", line 24, in _run_exitfuncs
func(*targs, **kargs)
File "/home/yienyien/Angus/test/futur/env/local/lib/python2.7/site-
packages/concurrent/futures/thread.py", line 46, in _python_exit
t.join(sys.maxint)
File "/usr/lib/python2.7/threading.py", line 951, in join
self.__block.wait(delay)
File "/usr/lib/python2.7/threading.py", line 359, in wait
_sleep(delay)
KeyboardInterrupt
Somebody could explain me what is happening ? I check the possible deadlocks in the concurrent.futures module, but I do not think it matches.
Thank you.

tasks submitted to a fixed-sized thread pool may not call blocking operations like Future.result(). This leads to a specific kind of deadlock, called "thread starvation". Using time.sleep() also switches a thread off the service and increases probability of thread starvation.

I answer to my own question.
After investigation, it's simple. I do not shutdown the TheadPoolExecutor and do not use with, then sometimes the main function completes and the finalize the main thread, the ThreadPoolExecutor state becomes "shutdown" whereas callback is not completed.

Related

Multiprocessing Process.join() hangs

I have a worker process that goes like this:
class worker(Process):
def __init__(self):
# init stuff
def run(self):
# do stuff
logging.info("done") # to confirm that the process is done running
And I start 3 processes like this:
processes = 3
aproc = [None for _ in processes]
bproc = [None for _ in processes]
for i in range(processes):
aproc[i] = worker(foo, bar)
bproc[i] = worker2(foo, bar) # different worker class
aproc[i].start()
bproc[i].start()
However, at the end of my code, I .join each of the processes, but they just hang and the script never ends.
for i in range(processes):
aproc[i].join()
bproc[i].join()
Hitting CTRL+C gives me this traceback:
Traceback (most recent call last):
File "[REDACTED]", line 571, in <module>
sproc[0].join()
File "/usr/lib/python3.9/multiprocessing/process.py", line 149, in join
res = self._popen.wait(timeout)
File "/usr/lib/python3.9/multiprocessing/popen_fork.py", line 43, in wait
return self.poll(os.WNOHANG if timeout == 0.0 else 0)
File "/usr/lib/python3.9/multiprocessing/popen_fork.py", line 27, in poll
pid, sts = os.waitpid(self.pid, flag)
I've heard of the typical deadlock, but this shouldn't be the case since all the processes print the logging statement that they are done running. Why is .join() still waiting on them? Any ideas? Thank you!
Edit: Unfortunately I can't get a minimal example working to share. Also, they do communicate with each other through multiprocessing.Queue()s, if that is relevant.
Edit 2:
Traceback of another test:
Traceback (most recent call last):
File "/usr/lib/python3.9/multiprocessing/util.py", line 300, in _run_finalizers
finalizer()
File "/usr/lib/python3.9/multiprocessing/util.py", line 224, in __call__
res = self._callback(*self._args, **self._kwargs)
File "/usr/lib/python3.9/multiprocessing/queues.py", line 201, in _finalize_join
thread.join()
File "/usr/lib/python3.9/threading.py", line 1033, in join
self._wait_for_tstate_lock()
File "/usr/lib/python3.9/threading.py", line 1049, in _wait_for_tstate_lock
elif lock.acquire(block, timeout):

"Dictionary size changed during iteration" from Pebble ProcessPool

We've some parallel processing code which is built around Pebble, it's been working robustly for quite some time but we seem to have run into some odd edge-case.
Based on the exception trace (and the rock-simple code feeding it) I suspect that it's actually a bug in Pebble but who knows.
The code feeding the process pool is pretty trivial:
pool = ProcessPool(max_workers=10, max_tasks=10)
for path in filepaths:
try:
future = pool.schedule(function=self.analyse_file, args(path), timeout=30)
future.add_done_callback(self.process_result)
exception Exception as e:
print("Exception fired:" + e) # NOT where the exception is firing
pool.close()
pool.join()
So in essence, we schedule a bunch of stuff to run, close out the pool then wait for the pool to complete the scheduled tasks. NOTE: the exception is not being thrown in the schedule loop, it gets fired AFTER we call join().
This is the exception stack trace:
Traceback (most recent call last):
File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/home/user/.pyenv/versions/scrapeapp/lib/python3.6/site-packages/pebble/pool/process.py", line 150, in task_scheduler_loop
pool_manager.schedule(task)
File "/home/user/.pyenv/versions/scrapeapp/lib/python3.6/site-packages/pebble/pool/process.py", line 198, in schedule
self.worker_manager.dispatch(task)
File "/home/user/.pyenv/versions/scrapeapp/lib/python3.6/site-packages/pebble/pool/process.py", line 327, in dispatch
self.pool_channel.send(WorkerTask(task.id, task.payload))
File "/home/user/.pyenv/versions/scrapeapp/lib/python3.6/site-packages/pebble/pool/channel.py", line 66, in send
return self.writer.send(obj)
File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/multiprocessing/connection.py", line 206, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
RuntimeError: dictionary changed size during iteration
I think it's got to be some weird race condition, as the code will work flawlessly on some data sets but fail at what appears to be a random point on another dataset.
We were using pebble 4.3.1 when we first ran into the issue (same version we'd had since the beginning), tried upgrading to 4.5.0, no change.
Has anybody run into similar issues with Pebble in the past? If so what was your fix?

Running threaded module with celery; daemonic processes are not allowed to have children

I have implemented a queue with celery in my flask app. Everything works good.
But I need to use this module called sublist3r and when I use it in a celery task I recieve this error:
[2019-02-16 21:32:52,658: INFO/ForkPoolWorker-6] Task tasks.task.addd[57793628-de25-4c89-a265-5fee69a8b2bf] succeeded in 0.0236732449848s: None
[2019-02-16 21:32:52,660: WARNING/ForkPoolWorker-6] Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "/home/me/code/proj/tasks/task.py", line 15, in getd
sub = sublist3r.main(url, 40, None, ports=None, silent=True,verbose=False, enable_bruteforce=False, engines=None)
File "/home/me/code/proj/sublist3r/sublist3r.py", line 871, in main
subdomains_queue = multiprocessing.Manager().list()
File "/usr/lib/python2.7/multiprocessing/__init__.py", line 99, in Manager
m.start()
File "/usr/lib/python2.7/multiprocessing/managers.py", line 524, in start
self._process.start()
File "/usr/lib/python2.7/multiprocessing/process.py", line 124, in start
'daemonic processes are not allowed to have children'
**AssertionError: daemonic processes are not allowed to have children**
Does this happen because I'm trying to use a module that uses threads?
How could I achieve using this module either in a queue or asynchronously?
Thank you
It appears that sublist3r uses multiprocessing and tries to kick off its own processes. You can't really do that within celery, because in production, celery will already kick off a worker in its own child process, and as you can tell from the error message, celery will not allow you to kick of the multiprocessing processes that sublist3r uses. If you want to use it, your best bet is to rewrite those classes in sublist3r yourself to derive from celery.Task instead of multiprocessing.Process.

Cement framework receive signal 15 on pool worker close

I'm experiencing a problem with the Cement framework for python (using python3 at the moment). I have a multiprocess application which uses python's Pool workers. A the end (it deos not interfere with the results) of every multiporcessing section my stdout is filled with one or more of these exceptions:
Traceback (most recent call last):
File "/usr/lib/python3.5/multiprocessing/util.py", line 254, in _run_finalizers
finalizer()
File "/usr/lib/python3.5/multiprocessing/util.py", line 186, in __call__
res = self._callback(*self._args, **self._kwargs)
File "/usr/lib/python3.5/multiprocessing/queues.py", line 198, in _finalize_join
thread.join()
File "/usr/lib/python3.5/threading.py", line 1054, in join
self._wait_for_tstate_lock()
File "/usr/lib/python3.5/threading.py", line 1070, in _wait_for_tstate_lock
elif lock.acquire(block, timeout):
File "/home/yogaub/.virtualenvs/seminar/lib/python3.5/site-packages/cement/core/foundation.py", line 123, in cement_signal_handler
raise exc.CaughtSignal(signum, frame)
cement.core.exc.CaughtSignal: Caught signal 15
Does anyone know why this happens, and how to prevent it?
Thanks
edit: I should add that i'm logging with the multiprocess logging system of this question. I don't really know if there is any correlation.
edit2: This is the process pool creation and termination:
pool = Pool(processes=core_num)
pool.map(worker_unpacker.work, formatted_input)
pool.close()
t2 = time.time()
I've tried catching sigterm with Cement's hook system but it doesn't work. The only solution I found at the moment is to actually completely ignore signals in the cement app configuration (but it is not really a solution I like..).
This is an educated guess: The parent process kills (terminate()s) the started processes on exit. If you call pool.join() in the parent process, then the parent process waits until all sub processes are finished and will not send SIGTERM to them.

Finding exception in python multiprocessing

I have a bit of python code that looks like this:
procs = cpu_count()-1
if serial or procs == 1:
results = map(do_experiment, experiments)
else:
pool = Pool(processes=procs)
results = pool.map(do_experiment, experiments)
It runs fine when I set the serial flag, but it gives the following error when the Pool is used. When I try to print something from do_experiment nothing shows up, so I can't try/catch there and print a stack trace.
Exception in thread Thread-2:
Traceback (most recent call last):
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 530, in __bootstrap_inner
self.run()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 483, in run
self.__target(*self.__args, **self.__kwargs)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/pool.py", line 285, in _handle_tasks
put(task)
TypeError: 'NoneType' object is not callable
What is a good way to proceed debugging this?
I went back in my git history until I found a commit where things were still working.
I added a class to my code that extends dict so that keys can be accessed with a . (so dict.foo in stead of dict["foo"]. Multiprocessing did not take kindly to this, using an ordinary dict solved the problem.

Categories