Handling worker death in multiprocessing Pool - python

I have a simple server:
from multiprocessing import Pool, TimeoutError
import time
import os
if __name__ == '__main__':
# start worker processes
pool = Pool(processes=1)
while True:
# evaluate "os.getpid()" asynchronously
res = pool.apply_async(os.getpid, ()) # runs in *only* one process
try:
print(res.get(timeout=1)) # prints the PID of that process
except TimeoutError:
print('worker timed out')
time.sleep(5)
pool.close()
print("Now the pool is closed and no longer available")
pool.join()
print("Done")
If I run this I get something like:
47292
47292
Then I kill 47292 while the server is running. A new worker process is started but the output of the server is:
47292
47292
worker timed out
worker timed out
worker timed out
The pool is still trying to send requests to the old worker process.
I've done some work with catching signals in both server and workers and I can get slightly better behaviour but the server still seems to be waiting for dead children on shutdown (ie. pool.join() never ends) after a worker is killed.
What is the proper way to handle workers dying?
Graceful shutdown of workers from a server process only seems to work if none of the workers has died.
(On Python 3.4.4 but happy to upgrade if that would help.)
UPDATE:
Interestingly, this worker timeout problem does NOT happen if the pool is created with processes=2 and you kill one worker process, wait a few seconds and kill the other one. However, if you kill both worker processes in rapid succession then the "worker timed out" problem manifests itself again.
Perhaps related is that when the problem occurs, killing the server process will leave the worker processes running.

This behavior comes from the design of the multiprocessing.Pool. When you kill a worker, you might kill the one holding the call_queue.rlock. When this process is killed while holding the lock, no other process will ever be able to read in the call_queue anymore, breaking the Pool as it cannot communicate with its worker anymore.
So there is actually no way to kill a worker and be sure that your Pool will still be okay after, because you might end up in a deadlock.
multiprocessing.Pool does not handle the worker dying. You can try using concurrent.futures.ProcessPoolExecutor instead (with a slightly different API) which handles the failure of a process by default. When a process dies in ProcessPoolExecutor, the entire executor is shutdown and you get back a BrokenProcessPool error.
Note that there are other deadlocks in this implementation, that should be fixed in loky. (DISCLAIMER: I am a maintainer of this library). Also, loky let you resize an existing executor using a ReusablePoolExecutor and the method _resize. Let me know if you are interested, I can provide you some help starting with this package. (I realized we still need a bit of work on the documentation... 0_0)

Related

Understanding Gunicorn worker processes, when using threading module for sending mail

in my setup I am using Gunicorn for my deployment on a single CPU machine, with three worker process. I have came to ask this question from this answer: https://stackoverflow.com/a/53327191/10268003 . I have experienced that it is taking upto one and a half second to send mail, so I was trying to send email asynchronously. I am trying to understand what will happen to the worker process started by Gunicorn, which will be starting a new thread to send the mail, will the Process gets blocked until the mail sending thread finishes. In that case I beleive my application's throughput will decrease. I did not want to use celery because it seems to be overkill for setting up celery for just sending emails. I am currently running two containers on the same machine with three gunicorn workers each in development machine.
Below is the approach in question, the only difference is i will be using threading for sending mails.
import threading
from .models import Crawl
def startCrawl(request):
task = Crawl()
task.save()
t = threading.Thread(target=doCrawl,args=[task.id])
t.setDaemon(True)
t.start()
return JsonResponse({'id':task.id})
def checkCrawl(request,id):
task = Crawl.objects.get(pk=id)
return JsonResponse({'is_done':task.is_done, result:task.result})
def doCrawl(id):
task = Crawl.objects.get(pk=id)
# Do crawling, etc.
task.result = result
task.is_done = True
task.save()
Assuming that you are using gunicorn Sync (default), Gthread or Async workers, you can indeed spawn threads and gunicorn will take no notice/interfere. The threads are reused to answer following requests immediately after returning a result, not only after all Threads are joined again.
I have used this code to fire an independent event a minute or so after a request:
Timer(timeout, function_that_does_something, [arguments_to_function]).start()
You will find some more technical details in this other answer:
In normal operations, these Workers run in a loop until the Master either tells them to graceful shutdown or kills them. Workers will periodically issue a heartbeat to the Master to indicate that they are still alive and working. If a heartbeat timeout occurs, then the Master will kill the Worker and restart it.
Therefore, daemon and non-daemon threads that do not interfere with the Worker's main loop should have no impact. If the thread does interfere with the Worker's main loop, such as a scenario where the thread is performing work and will provide results to the HTTP Response, then consider using an Async Worker. Async Workers allow for the TCP connection to remain alive for a long time while still allowing the Worker to issue heartbeats to the Master.
I have recently gone on to use asynchronous event loop based solutions like the uvicorn worker for gunicorn with the fastapi framework that provide alternatives to waiting in threads for IO.

Python Child Processes Not Always Cleaning Up

I'm working on a small bit of Python (2.7.3) to run a script which constantly monitors a message/queue broker and processes entries it finds. Due to volume the processing is embedded in multi-processing fashion that looks a bit like this:
result = Queue()
q = JoinableQueue()
process_pool = []
for i in range(args.max_processes):
q.put(i)
p = Process(target=worker, args=(q, result, broker, ...))
process_pool.append(p)
#Start processes
for p in process_pool:
p.start()
#Block until completion
q.join()
logger.warning("All processes completed")
Despite the code regularly iterating and logging that all processes had completed, I found PIDs gradually stacking up beyond args.max_processes.
I added an additional block to the end of this:
for p in process_pool:
if p.is_alive():
logger.warning("Process with pid %s is still alive - terminating" % p.pid)
try:
p.terminate()
except exception as e:
logger.warning("PROBLEM KILLING PID: stack: %s" % e)
I reaped all processes to clear the slate, started again, and I can clearly see the logger very intermittently show instances where a PID is found to still be alive even after it has flagged completion to the parent process AND the terminate process fails to kill it.
I added logger output to the individual threads and each one logs success indicating that it's completed cleanly prior to signaling completion to the parent process, and yet it still hangs around.
Because I plan to run this as a service over time the number of vagrant processes lying around can cause problems as they stack into the thousands.
I'd love some insight on what I'm missing and doing wrong.
Thank you
Edit: Update - adding overview of worker block for question completeness:
The worker interacts with a message/queue broker, the details of which I'll omit somewhat since outside of some debug logging messages everything is wrapped in a try/except block and each thread appears to run to completion, even on the occasions when a child process gets left behind.
def worker(queue, result_queue, broker, other_variables...):
logger.warning("Launched individual thread")
job = queue.get()
try:
message_broker logic
except Exception as e:
result_queue.put("Caught exception: %s" % e.message)
logger.warning("Individual thread completed cleanly...")
queue.task_done()
To iterate the problem. Without any exceptions being thrown and caught I can see all the logging to indicate n threads are started, run to completion, and complete with good status on each iteration. The blocking "q.join()" call which cannot complete until all threads complete returns each time, but some very small number of processes get left behind. I can see them with ps -ef, and if I monitor their count overtime it gradually increases until it breaks Pythons multi-threading capabilities. I added code to look for these instances and manually terminate them, which works in so far as it detects the hung processes, but it does not seem able to terminate them, despite the processes having returned good completion. What am I missing?
Thanks again!

Reduce the number of workers on a machine in Python-RQ?

What is a good way to Reduce the number of workers on a machine in Python-RQ?
According to the documentation, I need to send a SIGINT or SIGTERM command to one of the worker processes on the machine:
Taking down workers
If, at any time, the worker receives SIGINT (via Ctrl+C) or SIGTERM (via kill), the worker wait until the currently running task is finished, stop the work loop and gracefully register its own death.
If, during this takedown phase, SIGINT or SIGTERM is received again,the worker will forcefully terminate the child process (sending it SIGKILL), but will still try to register its own death.
This seems to imply a lot of coding overhead:
Would need to keep track of the PID for the worker process
Would need to have a way to send a SIGINT command from a remote machine
Do I really need to custom build this, or is there a way to do this easily using the Python-RQ library or some other existing library?
Get all running workers using rq.Worker.all()
Select the worker you want to kill
Use os.kill(worker.pid, signal.SIGINT)

Processes sharing queue not terminating properly

I have a multiprocessing application where the parent process creates a queue and passes it to worker processes. All processes use this queue for creating a queuehandler for the purpose of logging. There is a worker process reading from this queue and doing logging.
The worker processes continuously check if parent is alive or not. The problem is that when I kill the parent process from command line, all workers are killed except for one. The logger process also terminates. I don't know why one process keeps executing. Is it because of any locks etc in queue? How to properly exit in this scenario? I am using
sys.exit(0)
for exiting.
I would use sys.exit(0) only if there is no other chance. It's always better to cleanly finish each thread / process. You will have some while loop in your Process. So just do break there, so that it can come to an end.
Tidy up before you leave, i.e., release all handles of external resources, e.g., files, sockets, pipes.
Somewhere in these handles might be the reason for the behavior you see.

Parent Thread exiting before Child Threads [python]

I'm using Python in a webapp (CGI for testing, FastCGI for production) that needs to send an occasional email (when a user registers or something else important happens). Since communicating with an SMTP server takes a long time, I'd like to spawn a thread for the mail function so that the rest of the app can finish up the request without waiting for the email to finish sending.
I tried using thread.start_new(func, (args)), but the Parent return's and exits before the sending is complete, thereby killing the sending process before it does anything useful. Is there anyway to keep the process alive long enough for the child process to finish?
Take a look at the thread.join() method. Basically it will block your calling thread until the child thread has returned (thus preventing it from exiting before it should).
Update:
To avoid making your main thread unresponsive to new requests you can use a while loop.
while threading.active_count() > 0:
# ... look for new requests to handle ...
time.sleep(0.1)
# or try joining your threads with a timeout
#for thread in my_threads:
# thread.join(0.1)
Update 2:
It also looks like thread.start_new(func, args) is obsolete. It was updated to thread.start_new_thread(function, args[, kwargs]) You can also create threads with the higher level threading package (this is the package that allows you to get the active_count() in the previous code block):
import threading
my_thread = threading.Thread(target=func, args=(), kwargs={})
my_thread.daemon = True
my_thread.start()
You might want to use threading.enumerate, if you have multiple workers and want to see which one(s) are still running.
Other alternatives include using threading.Event---the main thread sets the event to True and starts the worker thread off. The worker thread unsets the event when if finishes work, and the main check whether the event is set/unset to figure out if it can exit.

Categories