How to detect unresponsive/frozen processes?

How to detect unresponsive/frozen processes? - python

I have several scripts that I use to do some web crawling. They are always running, and should never stop. However, after about a week, they systematically "freeze": there is no output anymore, no response to Ctrl+C or anything. The only way is to kill the process and restart it.
I suspect that these issues come from the library I use for retrieving the data (urllib2), but the issue is very hard to reproduce.
I am thus wondering how I could check the state of the process and kill/restart it automatically if it is frozen. I was thinking of creating a PID file, and update it regularly. Another script could then periodically check the last modification date of this PID file, and restart the process if it's too old. I could use something like Monit to do the monitoring.
Is this how I should do it? Is there another best practice/common way for checking the responsiveness of a process?

If you have a process that is always running, has no connected terminal, and is the process group leader - that is a daemon. You undoubtedly know all that.
There are some defacto practices in coding programs like that. One is to have a signal handler which takes SIGHUP and forces the program to reinitialize itself. This means closing all of the open log files, rereading config scripts, etc. I do not know how applicable that is to your problem but it sometimes solves issues like frozen daemons at my work.
You can customize the idea by employing SIGUSR1 and SIGUSR2 signals to do special things, like write status to a file, or anything else. Since signals come in on an interrupt, the trap statement in scripts and signal handlers in python itself will push program state onto the interrupt stack and do "stuff".
In your case you may want the program fork/exec itself and then kill the parent.

Related

How to kill sub process in anytime while closing it in orderly fashion

I have a GUI that run multiple tests files.
All the tests are python files that the GUI run one by one using subprocess running from thread.
In the GUI , the user has the ability to stop the machine that run the tests.
I already found out how to kill a subprocess for this purpose and its works.
see Killing sub process that run inside a thread
Now , I have another issue.
The test open lots of instances that I must close in orderly fashion.
When I use kill() method , all the instances are not closed and I cant run further tests.
I.E the COM port I used in the test before the stop from the user , is still occupied and prevent me to run any more tests.
The only remedy for it, will be closing the GUI and start it over.
In short ,I need a way to kill a subprocess in anytime but still close all the instances like sys.exit() will do or the same effect that closing he GUI has.
I've try to use wait(timeout) but unless I'm using it wrongly , it don't do the trick.
Can I use kind of interrupt to the test that will call some method in the test , that will close it in orderly fashion?

You should be able to use something like this solution to identify and terminate the subprocesses launched by the application to run tests. This solution seems to particularly fit your use case of killing all background processes in order to allow your application to run more tests.
Alternatively, you may be able to create a try-except wrapper or decorator for your tests that allow them to be killed when sent a certain signal is sent by the application, but this requires the application to keep track of all initiated subprocesses and may fail to properly kill all processes if the tests launch additional subprocesses in the background.

Spawn a subprocess but kill it if main process gets killed

I am creating a program in Python that listens to varios user interactions and logs them. I have these requirements/restrictions:
I need a separate process that sends those logs to a remote database every hour
I can't do it in the current process because it blocks the UI.
If the main process stops, the background process should also stop.
I've been reading about subprocess but I can't seem to find anything on how to stop both simultaneously. I need the equivalent of spawn_link if anybody know some Erlang/Elixir.
Thanks!

To answer the question in the title (for visitors from google): there are robust solutions on Linux, Windows using OS-specific APIs and less robust but more portable psutil-based solutions.
To fix your specific problem (it is XY problem): use a daemon thread instead of a process.
A thread would allow to perform I/O without blocking GUI, code example (even if GUI you've chosen doesn't provide async. I/O API such as tkinter's createfilehandler() or gtk's io_add_watch()).

Debugging a python application that just sort of "hangs"

I have an event-driven application, written in python. After a while (usually >1 week) it appears to just stop responding to events. When this happens, I just ctrl-C and re-run and all is well-again. However, it's kind of annoying that this keeps happening and I have no idea what's causing it. Is there a way I can run my application that when this occurs and the application is no longer accepting connections, I can drop into a debugger and see what it's doing and why it's not taking connections?
I've used pdb before, but the way I've used it (if condition: pdb.set_trace()) doesn't really apply here, because I have no idea what it's doing in the code when it fails. My ideal situation would be instead of Ctrl-C maybe I hit Ctrl-somethingelse and that causes it to stop and drop into the debugger. Is such a thing easily done?

Triggering pdb in your case is probably not simple. However, whenever I need to debug such hangs, I inspect a "snapshot" of tracebacks of all the threads in the process, using the dumpstacks() function.
You can either use a timer to call it periodically and print the output to a log file, and refer to it when you notice the hanging, or harness some RPC mechanism (e.g. signals) to trigger the function call in your process on demand. I usually do the latter, because the processes in my system already listen to such RPC requests (using rpyc).

Best practice: Monitor processes

I was wondering what the best practice solution would be to constantly monitor and resart processes, because there are multiple ways in doing it.
Additional info:
I have a unix program which uses multiple processes to work. There's a main process, it always starts first and is not likely to die or terminate without stopping the program.
Then I spawn multiple "module" processes, which take care of some work and communicate through the main process. Those modules sometimes die because of exceptions, and because it's an external program I can't resolve the issues, so I have to restart them if they die.
I've made a program to check if any of the modules died and restart them, but I need to run it manually. My program checks if the pid files of the modules exist and if they listen on a specific tcp port. If the pid file doesn't exist or the socket can't establish connection, it restarts the module.
My thoughts so far:
Cron job to run the checks every minute and restart any dead modules. (kind of an overkill, because they don't die that frequently)
Daemon running in the background, which starts the modules and receives notifications if they die, so it doesn't have to check them constantly. (SIGCHLD signal, os.wait)
If I use the daemon method, how should I communicate with the daemon through my interface? (socket, or maybe a file which gets read if the daemon receives a specific signal)
Usually I would just go with the daemon because it seems to be the best practice method to restart the modules asap(cron only runs once a minute), but I've wanted to get some opinions from more experienced users. (I've never done something like this before, and asking doesn't hurt anyone :D)
I apologize if these questions are answered somewhere else, but I couldn't find any related question.
P.S. If I forgot something or you need more infos, please feel free to ask. :)

I would investigate running the monitoring process as part of a dedicated monitoring framework. Monit is one example, however there are of course others.
This has the advantage of providing additional features which might be useful, such as email alerts and analytics. In my experience, you should be able to use your existing program without too much modification, and Monit itself uses few system resources if that is a concern.

How do I run long term (infinite) Python processes?

I've recently started experimenting with using Python for web development. So far I've had some success using Apache with mod_wsgi and the Django web framework for Python 2.7. However I have run into some issues with having processes constantly running, updating information and such.
I have written a script I call "daemonManager.py" that can start and stop all or individual python update loops (Should I call them Daemons?). It does that by forking, then loading the module for the specific functions it should run and starting an infinite loop. It saves a PID file in /var/run to keep track of the process. So far so good. The problems I've encountered are:
Now and then one of the processes will just quit. I check ps in the morning and the process is just gone. No errors were logged (I'm using the logging module), and I'm covering every exception I can think of and logging them. Also I don't think these quitting processes has anything to do with my code, because all my processes run completely different code and exit at pretty similar intervals. I could be wrong of course. Is it normal for Python processes to just die after they've run for days/weeks? How should I tackle this problem? Should I write another daemon that periodically checks if the other daemons are still running? What if that daemon stops? I'm at a loss on how to handle this.
How can I programmatically know if a process is still running or not? I'm saving the PID files in /var/run and checking if the PID file is there to determine whether or not the process is running. But if the process just dies of unexpected causes, the PID file will remain. I therefore have to delete these files every time a process crashes (a couple of times per week), which sort of defeats the purpose. I guess I could check if a process is running at the PID in the file, but what if another process has started and was assigned the PID of the dead process? My daemon would think that the process is running fine even if it's long dead. Again I'm at a loss just how to deal with this.
Any useful answer on how to best run infinite Python processes, hopefully also shedding some light on the above problems, I will accept
I'm using Apache 2.2.14 on an Ubuntu machine.
My Python version is 2.7.2

I'll open by stating that this is one way to manage a long running process (LRP) -- not de facto by any stretch.
In my experience, the best possible product comes from concentrating on the specific problem you're dealing with, while delegating supporting tech to other libraries. In this case, I'm referring to the act of backgrounding processes (the art of the double fork), monitoring, and log redirection.
My favorite solution is http://supervisord.org/
Using a system like supervisord, you basically write a conventional python script that performs a task while stuck in an "infinite" loop.
#!/usr/bin/python
import sys
import time
def main_loop():
while 1:
# do your stuff...
time.sleep(0.1)
if __name__ == '__main__':
try:
main_loop()
except KeyboardInterrupt:
print >> sys.stderr, '\nExiting by user request.\n'
sys.exit(0)
Writing your script this way makes it simple and convenient to develop and debug (you can easily start/stop it in a terminal, watching the log output as events unfold). When it comes time to throw into production, you simply define a supervisor config that calls your script (here's the full example for defining a "program", much of which is optional: http://supervisord.org/configuration.html#program-x-section-example).
Supervisor has a bunch of configuration options so I won't enumerate them, but I will say that it specifically solves the problems you describe:
Backgrounding/Daemonizing
PID tracking (can be configured to restart a process should it terminate unexpectedly)
Log normally in your script (stream handler if using logging module rather than printing) but let supervisor redirect to a file for you.

You should consider Python processes as able to run "forever" assuming you don't have any memory leaks in your program, the Python interpreter, or any of the Python libraries / modules that you are using. (Even in the face of memory leaks, you might be able to run forever if you have sufficient swap space on a 64-bit machine. Decades, if not centuries, should be doable. I've had Python processes survive just fine for nearly two years on limited hardware -- before the hardware needed to be moved.)
Ensuring programs restart when they die used to be very simple back when Linux distributions used SysV-style init -- you just add a new line to the /etc/inittab and init(8) would spawn your program at boot and re-spawn it if it dies. (I know of no mechanism to replicate this functionality with the new upstart init-replacement that many distributions are using these days. I'm not saying it is impossible, I just don't know how to do it.)
But even the init(8) mechanism of years gone by wasn't as flexible as some would have liked. The daemontools package by DJB is one example of process control-and-monitoring tools intended to keep daemons living forever. The Linux-HA suite provides another similar tool, though it might provide too much "extra" functionality to be justified for this task. monit is another option.

I assume you are running Unix/Linux but you don't really say. I have no direct advice on your issue. So I don't expect to be the "right" answer to this question. But there is something to explore here.
First, if your daemons are crashing, you should fix that. Only programs with bugs should crash. Perhaps you should launch them under a debugger and see what happens when they crash (if that's possible). Do you have any trace logging in these processes? If not, add them. That might help diagnose your crash.
Second, are your daemons providing services (opening pipes and waiting for requests) or are they performing periodic cleanup? If they are periodic cleanup processes you should use cron to launch them periodically rather then have them run in an infinite loop. Cron processes should be preferred over daemon processes. Similarly, if they are services that open ports and service requests, have you considered making them work with INETD? Again, a single daemon (inetd) should be preferred to a bunch of daemon processes.
Third, saving a PID in a file is not very effective, as you've discovered. Perhaps a shared IPC, like a semaphore, would work better. I don't have any details here though.
Fourth, sometimes I need stuff to run in the context of the website. I use a cron process that calls wget with a maintenance URL. You set a special cookie and include the cookie info in with wget command line. If the special cookie doesn't exist, return 403 rather than performing the maintenance process. The other benefit here is login to the database and other environmental concerns of avoided since the code that serves normal web pages are serving the maintenance process.
Hope that gives you ideas. I think avoiding daemons if you can is the best place to start. If you can run your python within mod_wsgi that saves you having to support multiple "environments". Debugging a process that fails after running for days at a time is just brutal.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.