Debugging a python application that just sort of "hangs"

Debugging a python application that just sort of "hangs" - python

I have an event-driven application, written in python. After a while (usually >1 week) it appears to just stop responding to events. When this happens, I just ctrl-C and re-run and all is well-again. However, it's kind of annoying that this keeps happening and I have no idea what's causing it. Is there a way I can run my application that when this occurs and the application is no longer accepting connections, I can drop into a debugger and see what it's doing and why it's not taking connections?
I've used pdb before, but the way I've used it (if condition: pdb.set_trace()) doesn't really apply here, because I have no idea what it's doing in the code when it fails. My ideal situation would be instead of Ctrl-C maybe I hit Ctrl-somethingelse and that causes it to stop and drop into the debugger. Is such a thing easily done?

Triggering pdb in your case is probably not simple. However, whenever I need to debug such hangs, I inspect a "snapshot" of tracebacks of all the threads in the process, using the dumpstacks() function.
You can either use a timer to call it periodically and print the output to a log file, and refer to it when you notice the hanging, or harness some RPC mechanism (e.g. signals) to trigger the function call in your process on demand. I usually do the latter, because the processes in my system already listen to such RPC requests (using rpyc).

Related

Pause Execution in Python

I am implementing a Python plugin that is part of a larger C++ program. The goal of this program is to allow the user to input a command's actions in Python. It currently receives a string from the C++ function and runs it via the exec() function. The user can then use an API to affect changes on the larger C++ program.
The current feature I am working on is a pause execution feature. It needs to remember where it is in the code execution as well as the state of any local variables, and resume execution once a condition has been met. I am not very familiar with Python, and I would like some advice how to implement this feature. My first design ideas:
1) Using the yield command.
This seemed to be a good idea at the start since when you use the next command it remembers everything I needed it to, but the problem is that yield only returns to the previous level in the call stack as far as I can tell. So if the user calls a function that yields it will simply return to the user's code, and not the larger C++ program. As far as I can tell there isn't a way to propagate the yield command up the stack???
2) Threading
Create a main python thread that creates a thread for each command. This main thread would spawn a thread for each command executed and kill it when it is done. If it needs to be suspended and restarted it could do so through a queue of locks.
Those were the only two options I came up with. I am not sure the yield function would work or is what it was designed to do. I think the Threading approach would work but might be overkill, and take a long time to develop. I also was looking for some sort of Task Module in Python, but couldn't find exactly what I was looking for. I was wondering if anyone has any other suggestions as I am not very familiar with Python.
EDIT: As mentioned in the comments I did not explain what needs to happen when the script "Pauses". The python plugin needs to allow the C++ program to continue execution. In my mind this means A) returning if we are talking about a single threaded approach, or B) Sending a message(Function call?) to C++
EDIT EDIT: As stated I didn't fully explain the problem description. I will make another post that has a better statement of what currently exists, and what needs to happen as well as providing some sudo code. I am new to Stack Overflow, so if this is not the appropriate response please let me know.

Whenever a signal is sent in Python, execution is immediately paused until whatever signal handler function is being used is finished executing; at that point, the execution continues right where it left off. My suggestion would be to use one of the user-defined signals (signal.SIGUSR1 and signal.SIGUSR2). Take a look at the signal documentation here:
https://docs.python.org/2/library/signal.html
At the beginning of the program, you'd define a signal handler function like so:
def signal_pause(signum, frame):
if signum == signal.SIGUSR1:
# Do your pause here - function processing, etc
else:
pass
Then in the main program somewhere, you'll switch out the default signal handler for the one you just created:
signal.signal(signal.SIGUSR1, signal_pause)
And finally, whenever you want to pause, you'll send the SIGUSR1 signal like so:
os.kill(os.getpid(),signal.SIGUSR1)
Your code will immediately pause, saving its state, and head to the signal_pause function to do whatever you need to do. Once that function exits, normal program execution will resume.
EDIT: this assumes you want to do something sophisticated while you're pausing the program. If all you want to do is wait a few seconds or ask for some user input, there are some much easier ways (time.sleep or input respectively).
EDIT EDIT: this assumes you're on a Unix system.

If you need to communicate with a C program, then sockets are probably the way to go.
https://docs.python.org/2/library/socket.html
One of your two programs acts as the socket server, and the other connects to it as the socket client. When you want the C++ program to continue, you use socket.send() to transmit a continue message. Then your Python program would use socket.recv(), which will cause it to wait around until it receives a message back from the C++ program.
If you need two programs to send signals to each other, this is probably the safest way to go about it.

How to detect unresponsive/frozen processes?

I have several scripts that I use to do some web crawling. They are always running, and should never stop. However, after about a week, they systematically "freeze": there is no output anymore, no response to Ctrl+C or anything. The only way is to kill the process and restart it.
I suspect that these issues come from the library I use for retrieving the data (urllib2), but the issue is very hard to reproduce.
I am thus wondering how I could check the state of the process and kill/restart it automatically if it is frozen. I was thinking of creating a PID file, and update it regularly. Another script could then periodically check the last modification date of this PID file, and restart the process if it's too old. I could use something like Monit to do the monitoring.
Is this how I should do it? Is there another best practice/common way for checking the responsiveness of a process?

If you have a process that is always running, has no connected terminal, and is the process group leader - that is a daemon. You undoubtedly know all that.
There are some defacto practices in coding programs like that. One is to have a signal handler which takes SIGHUP and forces the program to reinitialize itself. This means closing all of the open log files, rereading config scripts, etc. I do not know how applicable that is to your problem but it sometimes solves issues like frozen daemons at my work.
You can customize the idea by employing SIGUSR1 and SIGUSR2 signals to do special things, like write status to a file, or anything else. Since signals come in on an interrupt, the trap statement in scripts and signal handlers in python itself will push program state onto the interrupt stack and do "stuff".
In your case you may want the program fork/exec itself and then kill the parent.

Listening for subprocess failure in python

Using subprocess.Popen(), I'm launching a process that is supposed to take a long time. However, there is a chance that the process will fail shortly after it launches (producing a return code of 1). If that happens, I want to intercept the failure and present an explanatory message to the user. Is there a way to "listen" to the process and respond if it fails? I can't just use Popen.wait() because my python program has to keep running.
The hack I have in place right now is to time.sleep() my python program for .5 seconds (which should be enough time for the subprocess to to fail if it's going to do so). After the python program resumes, it polls the subprocess to determine if it has failed or not.
I imagine that a better solution might use threading and Popen.wait(), but I'm a relative beginner to python.
Edit:
The subprocess is a Java daemon that I'm launching. If another instance of the daemon is already running on the system, the Java subprocess will exit with a return code of 1, and I want to intercept the messy Java exception stack trace and present an understandable error message to the user.

Two approaches:
Call Popen.wait() on a thread as you suggested yourself, then call an error handler function if the exit code is non-zero. Make sure that the error handler is thread safe, preferably by dispatching the error message to the main thread if your application has an event loop.
Rewrite your application to use an event loop that already supports monitoring child processes, such as pyev. If you just want to monitor one subprocess, this is probably overkill.

How do I run long term (infinite) Python processes?

I've recently started experimenting with using Python for web development. So far I've had some success using Apache with mod_wsgi and the Django web framework for Python 2.7. However I have run into some issues with having processes constantly running, updating information and such.
I have written a script I call "daemonManager.py" that can start and stop all or individual python update loops (Should I call them Daemons?). It does that by forking, then loading the module for the specific functions it should run and starting an infinite loop. It saves a PID file in /var/run to keep track of the process. So far so good. The problems I've encountered are:
Now and then one of the processes will just quit. I check ps in the morning and the process is just gone. No errors were logged (I'm using the logging module), and I'm covering every exception I can think of and logging them. Also I don't think these quitting processes has anything to do with my code, because all my processes run completely different code and exit at pretty similar intervals. I could be wrong of course. Is it normal for Python processes to just die after they've run for days/weeks? How should I tackle this problem? Should I write another daemon that periodically checks if the other daemons are still running? What if that daemon stops? I'm at a loss on how to handle this.
How can I programmatically know if a process is still running or not? I'm saving the PID files in /var/run and checking if the PID file is there to determine whether or not the process is running. But if the process just dies of unexpected causes, the PID file will remain. I therefore have to delete these files every time a process crashes (a couple of times per week), which sort of defeats the purpose. I guess I could check if a process is running at the PID in the file, but what if another process has started and was assigned the PID of the dead process? My daemon would think that the process is running fine even if it's long dead. Again I'm at a loss just how to deal with this.
Any useful answer on how to best run infinite Python processes, hopefully also shedding some light on the above problems, I will accept
I'm using Apache 2.2.14 on an Ubuntu machine.
My Python version is 2.7.2

I'll open by stating that this is one way to manage a long running process (LRP) -- not de facto by any stretch.
In my experience, the best possible product comes from concentrating on the specific problem you're dealing with, while delegating supporting tech to other libraries. In this case, I'm referring to the act of backgrounding processes (the art of the double fork), monitoring, and log redirection.
My favorite solution is http://supervisord.org/
Using a system like supervisord, you basically write a conventional python script that performs a task while stuck in an "infinite" loop.
#!/usr/bin/python
import sys
import time
def main_loop():
while 1:
# do your stuff...
time.sleep(0.1)
if __name__ == '__main__':
try:
main_loop()
except KeyboardInterrupt:
print >> sys.stderr, '\nExiting by user request.\n'
sys.exit(0)
Writing your script this way makes it simple and convenient to develop and debug (you can easily start/stop it in a terminal, watching the log output as events unfold). When it comes time to throw into production, you simply define a supervisor config that calls your script (here's the full example for defining a "program", much of which is optional: http://supervisord.org/configuration.html#program-x-section-example).
Supervisor has a bunch of configuration options so I won't enumerate them, but I will say that it specifically solves the problems you describe:
Backgrounding/Daemonizing
PID tracking (can be configured to restart a process should it terminate unexpectedly)
Log normally in your script (stream handler if using logging module rather than printing) but let supervisor redirect to a file for you.

You should consider Python processes as able to run "forever" assuming you don't have any memory leaks in your program, the Python interpreter, or any of the Python libraries / modules that you are using. (Even in the face of memory leaks, you might be able to run forever if you have sufficient swap space on a 64-bit machine. Decades, if not centuries, should be doable. I've had Python processes survive just fine for nearly two years on limited hardware -- before the hardware needed to be moved.)
Ensuring programs restart when they die used to be very simple back when Linux distributions used SysV-style init -- you just add a new line to the /etc/inittab and init(8) would spawn your program at boot and re-spawn it if it dies. (I know of no mechanism to replicate this functionality with the new upstart init-replacement that many distributions are using these days. I'm not saying it is impossible, I just don't know how to do it.)
But even the init(8) mechanism of years gone by wasn't as flexible as some would have liked. The daemontools package by DJB is one example of process control-and-monitoring tools intended to keep daemons living forever. The Linux-HA suite provides another similar tool, though it might provide too much "extra" functionality to be justified for this task. monit is another option.

I assume you are running Unix/Linux but you don't really say. I have no direct advice on your issue. So I don't expect to be the "right" answer to this question. But there is something to explore here.
First, if your daemons are crashing, you should fix that. Only programs with bugs should crash. Perhaps you should launch them under a debugger and see what happens when they crash (if that's possible). Do you have any trace logging in these processes? If not, add them. That might help diagnose your crash.
Second, are your daemons providing services (opening pipes and waiting for requests) or are they performing periodic cleanup? If they are periodic cleanup processes you should use cron to launch them periodically rather then have them run in an infinite loop. Cron processes should be preferred over daemon processes. Similarly, if they are services that open ports and service requests, have you considered making them work with INETD? Again, a single daemon (inetd) should be preferred to a bunch of daemon processes.
Third, saving a PID in a file is not very effective, as you've discovered. Perhaps a shared IPC, like a semaphore, would work better. I don't have any details here though.
Fourth, sometimes I need stuff to run in the context of the website. I use a cron process that calls wget with a maintenance URL. You set a special cookie and include the cookie info in with wget command line. If the special cookie doesn't exist, return 403 rather than performing the maintenance process. The other benefit here is login to the database and other environmental concerns of avoided since the code that serves normal web pages are serving the maintenance process.
Hope that gives you ideas. I think avoiding daemons if you can is the best place to start. If you can run your python within mod_wsgi that saves you having to support multiple "environments". Debugging a process that fails after running for days at a time is just brutal.

How to implement pause (and more) functionality?

My apologies beforehand for the length of the question, I didn't want to leave anything out.
Some background information
I'm trying to automate a data entry process by writing a Python application that uses the Windows API to simulate keystrokes, mouse movement and window/control manipulation. I have to resort to this method because I do not (yet) have the security clearance required to access the datastore/database directly (e.g. using SQL) or indirectly through a better suited API. Bureaucracy, it's a pain ;-)
The data entry process involves the correction of sales orders due to changes in article availability. The unavailable articles are either removed from the order or replaced by another suitable article.
Initially I want a human to be able to monitor the automatic data entry process to make sure everything goes right. To achieve this I slow down the actions on the one hand but also inform the user of what is currently going on through a pinned window.
The actual question
To allow the user to halt the automation process I'm registering the Pause/Break key as a hotkey and in the handler I want to pause the automation functionality. However, I'm currently struggling to figure out a way to properly pause the execution of the automation functionality. When the pause function is invoked I want the automation process to stop dead in its tracks, no matter what it is doing. I don't want it to even execute another keystroke.
UPDATE [23/01]: I actually want to do more than just pause, I want to be able to communicate with the automation process while it is running and request it to pause, skip the current sales order, give up completely and perhaps even more.
Can anybody show me The Right Way (TM) to achieve what I want?
Some more information
Here's an example of how the automation works (I'm using the pywinauto library):
from pywinauto import application
app = application.Application()
app.start_("notepad")
app.Notepad.TypeKeys("abcdef")
UPDATE [25/01]: After a few days of working on my application I've noticed I don't really use pywinauto that much, right now I'm only using it for finding window and then I directly use SendKeysCtypes.SendKeys to simulate keyboard input and win32api functions to simulate mouse input.
What I've found out so far
Here are a few methods I've come across so far in my search for an answer:
I could separate the automation functionality and the interface + hotkey listener in two separate processes. Let's refer to the former as "automator" and the latter as "manager". The manager can then pause the execution of the automator by sending the process a SIGSTOP signal and unpause it using the SIGCONT signal (or the Windows equivalents through SuspendThread/ResumeThread).
To be able to update the user interface the automator will need to inform the manager of its progression through some sort of an IPC mechanism.
Cons:
Would using SIGSTOP not be a little harsh? Would it even work properly? Lots of people seem to be advising against it and even calling it "dangerous".
I am worried that implementing the IPC mechanism is going to be a bit complicated. On the other hand, I have worked with DBus which wouldn't be too hard to implement.
The second method and one that lots of people seem to be suggesting involves using threads and essentially boils down to the following (simplified):
while True:
if self.pause: # pause
# Do the work...
However, doing it this way it seems it will only pause after there is no more work to do. The only way I see this method would work would be to divide the work (the entire automation process) into smaller work segments (i.e. tasks). Before starting on a new task the worker thread would check if it should pause and wait.
Cons:
Seems like an implementation to divide the work into smaller segments, such as the one above, would be very ugly code wise (aesthetically).
The way I imagine it, all statements would be transformed to look something like: queue.put((function, args)) (e.g. queue.put((app.Notepad.TypeKeys, "abcdef"))) and you'd have the automating process thread running through the tasks and continuously checking for the pause state before starting a task. That just can't be right...
The program would not actually stop dead in its tracks, but would first finish a task (however small) before actually pausing.
Progress made
UPDATE [23/01]: I've implemented a version of my application using the first method through the mentioned SuspendThread/ResumeThread functionality. So far this seems to work very nicely and also allows me to write the automation stuff just like you'd write any other script. The only quirk I've come across is that keyboard modifiers (CTRL, ALT, SHIFT) get "stuck" while paused. Something I can probably easily work around.
I've also written a test using the second method (threads and signals/message passing) and implemented the pause functionality. However, it looks really ugly (both checking for the pause flag and everything related to the "doing the work"). So if anybody can show me a proper example of something similar to the second method I'd appreciate it.
Related questions
Pausing a process?
Pausing a thread using threading class
Alex Martelli posted an answer saying:
There is no method for other threads to forcibly pause a thread (any more than there is for other threads to kill that thread) -- the target thread must cooperate by occasionally checking appropriate "flags" (a threading.Condition might be appropriate for the pause/unpause case).
He then referred to the multiprocessing module and SIGSTOP/SIGCONT.
Is there a way to indefinitely pause a thread?
Pausing a process in Windows
An answer to this question quotes the MSDN documentation regarding SuspendThread:
This function is primarily designed for use by debuggers. It is not intended to be used for thread synchronization. Calling SuspendThread on a thread that owns a synchronization object, such as a mutex or critical section, can lead to a deadlock if the calling thread tries to obtain a synchronization object owned by a suspended thread. To avoid this situation, a thread within an application that is not a debugger should signal the other thread to suspend itself. The target thread must be designed to watch for this signal and respond appropriately.
Is there any way to kill a Thread in Python?
How do I pass an exception between threads in python

Keep in mind that although in your level of abstraction, "executing a keystroke" is a single atomic operation, it's implemented on the machine as a rather complicated sequence of machine instructions. So, pausing a thread at arbitrary points could lead to things being in an indeterminate state. Sending SIGSTOP is the same level of dangerous as pausing a thread at an arbitrary point. Depending on where you are in a particular step, though, your automation could potentially be broken. For example, if you pause in the middle of a timing-dependent step.
It seems to me that this problem would be best solved at the level of the automation library. I'm not very familiar with the automation library that you're using. It might be worth contacting the developers of the library to see if they have any suggestions for pausing the execution of automation steps at safe sub-step levels.

I don't know pywinauto. But I'll assume that you have something like an Application class which you obtain and have methods like SendKeys/SendMouseEvent/etc to do things.
Create your own MyApplication class which holds a reference to pywinauto's application class. Provide the same methods but before each method check whether a pause event has occurred. If it has, you can jump into code which handles the pause event. That way you are checking for a pause every time you cause an event, but this all is handled by the one class without putting pause all over your code.
Once you've detected the pause you can handle it any way you like. For example, you can throw an exception to force giving up on the current task.

Separating the functionality and the interface thread/process is definately the best option imho, the second solution is quicker and easier but definately not better.
Perhaps using multiple threads and an exception would be a better idea than using multiple processes. But if you're using multiple processes than SIGSTOP might be your only way to get it to work.
Is there anything against using 2 threads for this?
1 thread for actually executing
1 thread for reading the user input

I use Python but not pywinauto; for this sort of tasks I use AutoHotKey . One way to implement a simple pause in an AutoHotkey script may be using a "toggle" key like ScrollLock and testing the key state in the script. Also, the script can restore the key state after switching the internal pause setting on / off.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.