Python Child Processes Not Always Cleaning Up - python

I'm working on a small bit of Python (2.7.3) to run a script which constantly monitors a message/queue broker and processes entries it finds. Due to volume the processing is embedded in multi-processing fashion that looks a bit like this:
result = Queue()
q = JoinableQueue()
process_pool = []
for i in range(args.max_processes):
q.put(i)
p = Process(target=worker, args=(q, result, broker, ...))
process_pool.append(p)
#Start processes
for p in process_pool:
p.start()
#Block until completion
q.join()
logger.warning("All processes completed")
Despite the code regularly iterating and logging that all processes had completed, I found PIDs gradually stacking up beyond args.max_processes.
I added an additional block to the end of this:
for p in process_pool:
if p.is_alive():
logger.warning("Process with pid %s is still alive - terminating" % p.pid)
try:
p.terminate()
except exception as e:
logger.warning("PROBLEM KILLING PID: stack: %s" % e)
I reaped all processes to clear the slate, started again, and I can clearly see the logger very intermittently show instances where a PID is found to still be alive even after it has flagged completion to the parent process AND the terminate process fails to kill it.
I added logger output to the individual threads and each one logs success indicating that it's completed cleanly prior to signaling completion to the parent process, and yet it still hangs around.
Because I plan to run this as a service over time the number of vagrant processes lying around can cause problems as they stack into the thousands.
I'd love some insight on what I'm missing and doing wrong.
Thank you
Edit: Update - adding overview of worker block for question completeness:
The worker interacts with a message/queue broker, the details of which I'll omit somewhat since outside of some debug logging messages everything is wrapped in a try/except block and each thread appears to run to completion, even on the occasions when a child process gets left behind.
def worker(queue, result_queue, broker, other_variables...):
logger.warning("Launched individual thread")
job = queue.get()
try:
message_broker logic
except Exception as e:
result_queue.put("Caught exception: %s" % e.message)
logger.warning("Individual thread completed cleanly...")
queue.task_done()
To iterate the problem. Without any exceptions being thrown and caught I can see all the logging to indicate n threads are started, run to completion, and complete with good status on each iteration. The blocking "q.join()" call which cannot complete until all threads complete returns each time, but some very small number of processes get left behind. I can see them with ps -ef, and if I monitor their count overtime it gradually increases until it breaks Pythons multi-threading capabilities. I added code to look for these instances and manually terminate them, which works in so far as it detects the hung processes, but it does not seem able to terminate them, despite the processes having returned good completion. What am I missing?
Thanks again!

Related

Handling worker death in multiprocessing Pool

I have a simple server:
from multiprocessing import Pool, TimeoutError
import time
import os
if __name__ == '__main__':
# start worker processes
pool = Pool(processes=1)
while True:
# evaluate "os.getpid()" asynchronously
res = pool.apply_async(os.getpid, ()) # runs in *only* one process
try:
print(res.get(timeout=1)) # prints the PID of that process
except TimeoutError:
print('worker timed out')
time.sleep(5)
pool.close()
print("Now the pool is closed and no longer available")
pool.join()
print("Done")
If I run this I get something like:
47292
47292
Then I kill 47292 while the server is running. A new worker process is started but the output of the server is:
47292
47292
worker timed out
worker timed out
worker timed out
The pool is still trying to send requests to the old worker process.
I've done some work with catching signals in both server and workers and I can get slightly better behaviour but the server still seems to be waiting for dead children on shutdown (ie. pool.join() never ends) after a worker is killed.
What is the proper way to handle workers dying?
Graceful shutdown of workers from a server process only seems to work if none of the workers has died.
(On Python 3.4.4 but happy to upgrade if that would help.)
UPDATE:
Interestingly, this worker timeout problem does NOT happen if the pool is created with processes=2 and you kill one worker process, wait a few seconds and kill the other one. However, if you kill both worker processes in rapid succession then the "worker timed out" problem manifests itself again.
Perhaps related is that when the problem occurs, killing the server process will leave the worker processes running.
This behavior comes from the design of the multiprocessing.Pool. When you kill a worker, you might kill the one holding the call_queue.rlock. When this process is killed while holding the lock, no other process will ever be able to read in the call_queue anymore, breaking the Pool as it cannot communicate with its worker anymore.
So there is actually no way to kill a worker and be sure that your Pool will still be okay after, because you might end up in a deadlock.
multiprocessing.Pool does not handle the worker dying. You can try using concurrent.futures.ProcessPoolExecutor instead (with a slightly different API) which handles the failure of a process by default. When a process dies in ProcessPoolExecutor, the entire executor is shutdown and you get back a BrokenProcessPool error.
Note that there are other deadlocks in this implementation, that should be fixed in loky. (DISCLAIMER: I am a maintainer of this library). Also, loky let you resize an existing executor using a ReusablePoolExecutor and the method _resize. Let me know if you are interested, I can provide you some help starting with this package. (I realized we still need a bit of work on the documentation... 0_0)

Multiprocesing pool.join() hangs under some circumstances

I am trying to create a simple producer / consumer pattern in Python using multiprocessing. It works, but it hangs on poll.join().
from multiprocessing import Pool, Queue
que = Queue()
def consume():
while True:
element = que.get()
if element is None:
print('break')
break
print('Consumer closing')
def produce(nr):
que.put([nr] * 1000000)
print('Producer {} closing'.format(nr))
def main():
p = Pool(5)
p.apply_async(consume)
p.map(produce, range(5))
que.put(None)
print('None')
p.close()
p.join()
if __name__ == '__main__':
main()
Sample output:
~/Python/Examples $ ./multip_prod_cons.py
Producer 1 closing
Producer 3 closing
Producer 0 closing
Producer 2 closing
Producer 4 closing
None
break
Consumer closing
However, it works perfectly when I change one line:
que.put([nr] * 100)
It is 100% reproducible on Linux system running Python 3.4.3 or Python 2.7.10. Am I missing something?
There is quite a lot of confusion here. What you are writing is not a producer/consumer scenario but a mess which is misusing another pattern usually referred as "pool of workers".
The pool of workers pattern is an application of the producer/consumer one in which there is one producer which schedules the work and many consumers which consume it. In this pattern, the owner of the Pool ends up been the producer while the workers will be the consumers.
In your example instead you have a hybrid solution where one worker ends up being a consumer and the others act as sort of middle-ware. The whole design is very inefficient, duplicates most of the logic already provided by the Pool and, more important, is very error prone. What you end up suffering from, is a Deadlock.
Putting an object into a multiprocessing.Queue is an asynchronous operation. It blocks only if the Queue is full and your Queue has infinite size.
This means your produce function returns immediately therefore the call to p.map is not blocking as you expect it to do. The related worker processes instead, wait until the actual message goes through the Pipe which the Queue uses as communication channel.
What happens next is that you terminate prematurely your consumer as you put in the Queue the None "message" which gets delivered before all the lists your produce function create are properly pushed through the Pipe.
You notice the issue once you call p.join but the real situation is the following.
the p.join call is waiting for all the worker processes to terminate.
the worker processes are waiting for the big lists to go though the Queue's Pipe.
as the consumer worker is long gone, nobody drains the Pipe which is obviously full.
The issue does not show if your lists are small enough to go through before you actually send the termination message to the consume function.

Python - Notifying another thread blocked on subprocess

I am creating a custom job scheduler with a web frontend in python 3.4 on linux. This program creates a daemon (consumer) thread that waits for jobs to come available in a PriorityQueue. These jobs can manually be added through the web interface which adds them to the queue. When the consumer thread finds a job, it executes a program using subprocess.run, and waits for it to finish.
The basic idea of the worker thread:
class Worker(threading.Thread):
def __init__(self, queue):
self.queue = queue
# more code here
def run(self):
while True:
try:
job = self.queue.get()
#do some work
proc = subprocess.run("myprogram", timeout=my_timeout)
#do some more things
except TimeoutExpired:
#do some administration
self.queue.add(job)
However:
This consumer should be able to receive some kind of signal from the frontend (main thread) that it should stop the current job and instead work on the next job in the queue (saving the state of the current job and adding it to the end of the queue again). This can (and will most likely) happen while blocked on subprocess.run().
The subprocesses can simply be killed (the program that is executed saves sme state in a file) but the worker thread needs to do some administration on the killed job to make sure it can be resumed later on.
There can be multiple such worker threads.
Signal handlers are not an option (since they are always handled by the main thread which is a webserver and should not be bothered with this).
Having an event loop in which the process actively polls for events (such as the child exiting, the timeout occurring or the interrupt event) is in this context not really a solution but an ugly hack. The jobs are performance-heavy and constant context switches are unwanted.
What synchronization primitives should I use to interrupt this thread or to make sure it waits for several events at the same time in a blocking fashion?
I think you've accidentally glossed over a simple solution: your second bullet point says that you have the ability to kill the programs that are running in subprocesses. Notice that subprocess.call returns the return code of the subprocess. This means that you can let the main thread kill the subprocess, and just check the return code to see if you need to do any cleanup. Even better, you could use subprocess.check_call instead, which will raise an exception for you if the returncode isn't 0. I don't know what platform you're working on, but on Linux, killed processes generally don't return a 0 if they're killed.
It could look something like this:
class Worker(threading.Thread):
def __init__(self, queue):
self.queue = queue
# more code here
def run(self):
while True:
try:
job = self.queue.get()
#do some work
subprocess.check_call("myprogram", timeout=my_timeout)
#do some more things
except (TimeoutExpired, subprocess.CalledProcessError):
#do some administration
self.queue.add(job)
Note that if you're using Python 3.5, you can use subprocess.run instead, and set the check argument to True.
If you have a strong need to handle the cases where the worker needs to be interrupted when it isn't running the subprocess, then I think you're going to have to use a polling loop, because I don't think the behavior you're looking for is supported for threads in Python. You can use a threading.Event object to pass the "stop working now" pseudo-signal from your main thread to the worker, and have the worker periodically check the state of that event object.
If you're willing to consider using multiple processing stead of threads, consider switching over to the multiprocessing module, which would allow you to handle signals. There is more overhead to spawning full-blown subprocesses instead of threads, but you're essentially looking for signal-like asynchronous behavior, and I don't think Python's threading library supports anything like that. One benefit though, would be that you would be freed from the Global Interpreter Lock(PDF link), so you may actually see some speed benefits if your worker processes (formerly threads) are doing anything CPU intensive.

Processes sharing queue not terminating properly

I have a multiprocessing application where the parent process creates a queue and passes it to worker processes. All processes use this queue for creating a queuehandler for the purpose of logging. There is a worker process reading from this queue and doing logging.
The worker processes continuously check if parent is alive or not. The problem is that when I kill the parent process from command line, all workers are killed except for one. The logger process also terminates. I don't know why one process keeps executing. Is it because of any locks etc in queue? How to properly exit in this scenario? I am using
sys.exit(0)
for exiting.
I would use sys.exit(0) only if there is no other chance. It's always better to cleanly finish each thread / process. You will have some while loop in your Process. So just do break there, so that it can come to an end.
Tidy up before you leave, i.e., release all handles of external resources, e.g., files, sockets, pipes.
Somewhere in these handles might be the reason for the behavior you see.

Parent Thread exiting before Child Threads [python]

I'm using Python in a webapp (CGI for testing, FastCGI for production) that needs to send an occasional email (when a user registers or something else important happens). Since communicating with an SMTP server takes a long time, I'd like to spawn a thread for the mail function so that the rest of the app can finish up the request without waiting for the email to finish sending.
I tried using thread.start_new(func, (args)), but the Parent return's and exits before the sending is complete, thereby killing the sending process before it does anything useful. Is there anyway to keep the process alive long enough for the child process to finish?
Take a look at the thread.join() method. Basically it will block your calling thread until the child thread has returned (thus preventing it from exiting before it should).
Update:
To avoid making your main thread unresponsive to new requests you can use a while loop.
while threading.active_count() > 0:
# ... look for new requests to handle ...
time.sleep(0.1)
# or try joining your threads with a timeout
#for thread in my_threads:
# thread.join(0.1)
Update 2:
It also looks like thread.start_new(func, args) is obsolete. It was updated to thread.start_new_thread(function, args[, kwargs]) You can also create threads with the higher level threading package (this is the package that allows you to get the active_count() in the previous code block):
import threading
my_thread = threading.Thread(target=func, args=(), kwargs={})
my_thread.daemon = True
my_thread.start()
You might want to use threading.enumerate, if you have multiple workers and want to see which one(s) are still running.
Other alternatives include using threading.Event---the main thread sets the event to True and starts the worker thread off. The worker thread unsets the event when if finishes work, and the main check whether the event is set/unset to figure out if it can exit.

Categories