What is a good way to Reduce the number of workers on a machine in Python-RQ?
According to the documentation, I need to send a SIGINT or SIGTERM command to one of the worker processes on the machine:
Taking down workers
If, at any time, the worker receives SIGINT (via Ctrl+C) or SIGTERM (via kill), the worker wait until the currently running task is finished, stop the work loop and gracefully register its own death.
If, during this takedown phase, SIGINT or SIGTERM is received again,the worker will forcefully terminate the child process (sending it SIGKILL), but will still try to register its own death.
This seems to imply a lot of coding overhead:
Would need to keep track of the PID for the worker process
Would need to have a way to send a SIGINT command from a remote machine
Do I really need to custom build this, or is there a way to do this easily using the Python-RQ library or some other existing library?
Get all running workers using rq.Worker.all()
Select the worker you want to kill
Use os.kill(worker.pid, signal.SIGINT)
Related
So I have a Flask application which I've been running on Flask's built-in server, and am ready to move it to production. This application manages several child processes. To this point, I've been handling graceful shutdown using signals. In particular, one shutdown mode I've used is to have sending a SIGHUP to the flask server cause the application to propagate that signal to its children (so they can gracefully shutdown), and then let the application process shutdown.
In production, we're planning on using mod_wsgi. I've read that wsgi applications really shouldn't be handling signals.
So my question is, how should I achieve the following behavior with this setup?:
When apache receives SIGTERM, it notifies the wsgi daemons before terminating them
The wsgi daemons are given a chance to do some cleanup on their own before shutting down
Send SIGTERM to the Apache parent process and that is what sort of happens now.
What happens is that when the Apache parent process receives SIGTERM, it in turn sends SIGTERM to all its child worker processes, as well as to the managed mod_wsgi daemon processes if using daemon mode. Those sub process will stop accepting new requests and will be given up to 3 seconds to complete existing requests, before the sub processes are forcibly shutdown.
So the default behaviour of SIGTERM is to allow a bit of time to complete requests, but long running requests will not be allowed to hold up complete server shutdown. How long it does wait for sub processes to shutdown is not configurable and is fixed at 3 seconds.
Instead of SIGTERM, you can send a SIGWINCH signal instead. This will cause Apache to do a graceful stop, but this has issues.
What happens in the case of SIGWINCH, is that Apache will send SIGTERM to its child worker process again, but instead of forcibly killing off the processes after 3 seconds, it will allow them to run until at least any active requests have completed.
A problem with this is that there is no fail safe. If those requests never finish, there is no timeout that I know of which will see the child worker processes forcibly shutdown. As a result, your server could end up hanging on shutdown.
A second issue is that Apache will still forcibly kill off the managed mod_wsgi daemon processes after 3 seconds and there isn't (or wasn't last time looked) a way to override how Apache manages those processes, to enable a more graceful shutdown on the managed daemon processes. So graceful stop signal doesn't change anything when using daemon mode.
The closest you can get to a graceful stop, is in a front end routing layer, divert new traffic away from the Apache instance. Then through some mechanism trigger within the host running Apache a script which sends a SIGUSR2 to the mod_wsgi daemon processes. Presuming you have set graceful-timeout option on the daemon process group to some adequate failsafe, this will result in the daemon processes exiting if all active requests finish. If the timeout expires, then it will go into its normal process shutdown sequence of not accept new requests from the Apache child worker processes and after the shutdown-timeout (default 5 seconds) fires, if requests still not complete, the process is forcibly shutdown.
In this case, it isn't actually shutting down the processes, but causing them to exit, which will result in them being replaced as we aren't telling the whole Apache to stop, but just telling the mod_wsgi daemon processes to do a graceful restart. In this situation, unless you monitor the set of daemon processes and know when they have all restarted, you don't have a clear indication that they are all done and can then shutdown the whole Apache instance.
So it is a little bit fiddly to do and it is hard for any server to do this in a nice generic way as what is appropriate really also depends on the hosted application and what its requirements are.
The question is if you really need to go to these lengths. Requests will inevitably fail anyway and users have to deal with that, so often interruption of a handful of requests on a restart is not a big deal. What is so special about the application that you need to set a higher bar and attempt to ensure that zero requests are interrupted?
Since mod_wsgi 4.8.0 you can also do the following:
import mod_wsgi
def shutdown_handler(event, **kwargs):
# do whatever you want on shutdown here
mod_wsgi.subscribe_shutdown(shutdown_handler)
I have a simple server:
from multiprocessing import Pool, TimeoutError
import time
import os
if __name__ == '__main__':
# start worker processes
pool = Pool(processes=1)
while True:
# evaluate "os.getpid()" asynchronously
res = pool.apply_async(os.getpid, ()) # runs in *only* one process
try:
print(res.get(timeout=1)) # prints the PID of that process
except TimeoutError:
print('worker timed out')
time.sleep(5)
pool.close()
print("Now the pool is closed and no longer available")
pool.join()
print("Done")
If I run this I get something like:
47292
47292
Then I kill 47292 while the server is running. A new worker process is started but the output of the server is:
47292
47292
worker timed out
worker timed out
worker timed out
The pool is still trying to send requests to the old worker process.
I've done some work with catching signals in both server and workers and I can get slightly better behaviour but the server still seems to be waiting for dead children on shutdown (ie. pool.join() never ends) after a worker is killed.
What is the proper way to handle workers dying?
Graceful shutdown of workers from a server process only seems to work if none of the workers has died.
(On Python 3.4.4 but happy to upgrade if that would help.)
UPDATE:
Interestingly, this worker timeout problem does NOT happen if the pool is created with processes=2 and you kill one worker process, wait a few seconds and kill the other one. However, if you kill both worker processes in rapid succession then the "worker timed out" problem manifests itself again.
Perhaps related is that when the problem occurs, killing the server process will leave the worker processes running.
This behavior comes from the design of the multiprocessing.Pool. When you kill a worker, you might kill the one holding the call_queue.rlock. When this process is killed while holding the lock, no other process will ever be able to read in the call_queue anymore, breaking the Pool as it cannot communicate with its worker anymore.
So there is actually no way to kill a worker and be sure that your Pool will still be okay after, because you might end up in a deadlock.
multiprocessing.Pool does not handle the worker dying. You can try using concurrent.futures.ProcessPoolExecutor instead (with a slightly different API) which handles the failure of a process by default. When a process dies in ProcessPoolExecutor, the entire executor is shutdown and you get back a BrokenProcessPool error.
Note that there are other deadlocks in this implementation, that should be fixed in loky. (DISCLAIMER: I am a maintainer of this library). Also, loky let you resize an existing executor using a ReusablePoolExecutor and the method _resize. Let me know if you are interested, I can provide you some help starting with this package. (I realized we still need a bit of work on the documentation... 0_0)
If I hook up a callback to the celery task_success signal handler, which process does it get executed in? The child or the worker process?
The documentation does not explicitly list it. (It lists it for the signal task_sent, but not for the other signals: http://docs.celeryproject.org/en/latest/userguide/signals.html#task-sent)
thanks...
There's no such thing as a "child" process; there is the process sending the task (which can be any Python process, including a celery worker, or celery beat, or anything else) and there is the worker that processes the task.
All the task signals except task_sent are executed in the worker that processes the task; in fact they can't possibly execute anywhere else. Celery signals (like Django signals) are not like operating system events, or like Celery tasks, which can originate in one process and trigger something in another process; they get processed in the same process as that in which they originate. They have nothing to do with the Python standard library signal module.
I have a multiprocessing application where the parent process creates a queue and passes it to worker processes. All processes use this queue for creating a queuehandler for the purpose of logging. There is a worker process reading from this queue and doing logging.
The worker processes continuously check if parent is alive or not. The problem is that when I kill the parent process from command line, all workers are killed except for one. The logger process also terminates. I don't know why one process keeps executing. Is it because of any locks etc in queue? How to properly exit in this scenario? I am using
sys.exit(0)
for exiting.
I would use sys.exit(0) only if there is no other chance. It's always better to cleanly finish each thread / process. You will have some while loop in your Process. So just do break there, so that it can come to an end.
Tidy up before you leave, i.e., release all handles of external resources, e.g., files, sockets, pipes.
Somewhere in these handles might be the reason for the behavior you see.
I wonder if I stop the twistd process using
kill `cat twistd.pid`
What will happen if there are exactly some sql execution committing?
Will it waiting for the execution done? or just unknown, it could be done, or abandon?
I know if I put the execution in the stopFactory method, the factory will do such things like waiting for the execution done. But if I don't, I mean the execution out the stopFactory method, will it waiting for the execution done before the factory stopping?
Thanks.
kill sends SIGTERM by default. Twisted installs a SIGTERM handler which calls reactor.stop(). Anything that would happen when you call reactor.stop() will happen when you use that kill command.
More specifically, any shutdown triggers will run. This means any services attached to an Application will have their stopService method called (and if a Deferred is returned, it will be allowed to finish before shutdown proceeds). It also means worker threads in the reactor threadpool will be shutdown in an orderly manner - ie, allowed to complete whatever job they have in progress.
If you're using adbapi, then the ConnectionPool uses its own ThreadPool and also registers a shutdown trigger to shut that pool down in a similar orderly manner.
So, when you use kill to stop a Twisted-based process, any SQL mid-execution will be allowed to complete before shutdown takes place.