"Crash-proofing" script vs.using systemd to guarantee near-constant operation

"Crash-proofing" script vs.using systemd to guarantee near-constant operation - python

Perhaps this is a broad question, but I haven't found an answer elsewhere, so here goes.
The Python script I'm writing needs to run constantly (in a perfect world, I recognize this may not be exactly possible) on a deployed device. I've already dedicated time to adding "try...except" statements throughout so that, should an issue arise, the script will recover and continue to work.
The issue is that I'm not sure I can (nor should) handle every single possible exception that may be thrown. As such, I've decided it may be better to allow the script to die and to use systemd to restart it.
The three options:
Making no attempt to handle any exception, and just allowing systemd to restart it whenever it dies.
Meticulously creating handlers for every possible exception to guarantee that, short of loss of power, interpreter bug, or heat death of the universe, the script will always run.
A mix of the two -- making an effort to prevent crashes in some cases while allowing them in others and letting systemd restart the script.
The third choice seems the most reasonable to me. So the question is this: What factors should be considered when optimizing between "crash-proof" code and allowing a crash and restart by systemd?
For some more application specific information: there is a small but noticeable overhead involved with starting the script, the main portion will run between 50 to 100 times per second, it is not "mission critical" in that there will be no death/damage in the event of failure (just some data loss), and I already expect intermittent issues with the network it will be on.

All known exceptional cases should be handled. Any undefined behavior is a potential security issue.
As you suggest, it is also prudent to plan for unknown exceptions. Perhaps there's also a small memory leak that will also cause the application to crash even when it's running correctly. So, it's still prudent to have systemd automatically restart it if it fails, even when all expected failure modes have been handled.

Related

How to farm possibly unstable code out to a separate process?

I've got a web.py server that has a problem: it's supposed to take input from the user and process it. Some of this processing takes place in a .pyd that I didn't write and I can't debug or rebuild, and it has some bugs. There are certain inputs that can cause it to throw an Access Violation, which for some reason cannot be caught and causes python.exe to crash, bringing down the server.
This is unacceptable, but also unavoidable, so I need to change the rules a little.
What I'd like to do is to move the unstable functionality, the interface to which is already cleanly contained in a single .py file, out of the server and have the server launch it as a separate process. If it works successfully, it should print the output, which (conveniently enough) is a text string. If not, the server should detect that the separate process crashed, and return HTTP 500. Unfortunately, I don't know how to do multi-process work in Python.
How would I implement this?

Is it possible to safely monitor the Python stack on a production Linux system?

A common debugging technique for performance problems during development (i.e. not in a production system) is to randomly halt the program and inspect the stack. If you see the same or similar stacks often, that is a clue that performance problems are caused by whatever is going on when the program is doing that.
I would like to do something similar in a production system, since there is information there that could be useful in fixing performance problems, if it could be non-destructively measured.
My first thought was to send a UNIX signal to the processes in question. A signal handler for that signal would then log the stack (logging from a signal handler presents its own complications, but they could be worked around and are not really relevant to this question).
However, that is problematic because signals may cause system calls to be interrupted, which might break otherwise working production code, either by changing its behaviour or by degrading performance through retries. I believe the Python standard library still does not handle EINTR correctly in many places, which aggravates this problem.
Is there a way to avoid these problems, partially or fully? The answer need not involve signals (but I'm not sure there is another way). Note that part of the appeal of triggering the stack logging with a signal is that that should help reveal problems that occur as a result of I/O, rather than only problems that result from excessive use of CPU time.

This seems to work:
import signal
import traceback
def handler(*_):
traceback.print_stack()
signal.signal(signal.SIGUSR1, handler)
Python has quite some expertise baked in on how to safely handle signals, so I think this should be OK. And it worked in my brief testing on Python 2.7.6.

Escaping arbitrary blocks of code

My script accepts arbitrary-length and -content strings of Python code, then runs them inside exec() statements. If the time to run the arbitrary code passes over some predetermined limit, then the exec() statement needs to exit and a boolean flag needs to be set to indicate that a premature exit has occurred.
How can this be accomplished?
Additional information
These pieces of code will be running in parallel in numerous threads (or at least as parallel as you can get with the GIL).
If there is an alternative method in another language, I am willing to try it out.
I plan on cleaning the code to prevent access to anything that might accidentally damage my system (file and system access, import statements, nested calls to exec() or eval(), etc.).
Options I've considered
Since the exec() statements are running in threads, use a poison pill to kill the thread. Unfortunately, I've read that poison pills do not work for all cases.
Running the exec() statements inside processes, then using process.terminate() to kill everything. But I'm running on Windows and I've read that process creation can be expensive. It also complicates communication with the code that's managing all of this.
Allowing only pre-written functions inside the exec() statements and having those functions periodically check for an exit flag then perform clean-up as necessary. This is complicated, time-consuming, and there are too many corner-cases to consider; I am looking for a simpler solution.
I know this is a bit of an oddball question that deserves a "Why would you ever want to allow arbitrary code to run in an exec() statement?" type of response. I'm trying my hand at a bit of self-evolving code. This is my major stumbling block at the moment: if you allow your code to do almost anything, then it can potentially hang forever. How do you regain control and stop it when it does?

This isn't a very detailed answer, but its more than I wanted to put into a comment.
You may want to consider something like this other question for creating functions with timeouts, using multiprocessing as a start.
The problem with threads is that you probably can't use your poison pill approach, as they are not workers taking many small bits of tasks. They would be sitting there blocking on a statement. It would never get the value to exit.
You mentioned that your concern about using processes on Windows is that they are expensive. So what you might do is create your own kind of process pool (a list of processes). They are all pulling from a queue, and you submit new tasks to the queue. If any process exceeds the timeout, you kill it, and replace it in the pool with a new one. That way you limit the overhead of creating new processes only to when they are timing out, instead of creating a new one for every task.

There are a few different options here.
First, start with jdi's suggestion of using multiprocessing. It may be that Windows process creation isn't actually expensive enough to break your use case.
If it actually is a problem, what I'd personally do is use Virtual PC, or even User Mode Linux, to just run the same code in another OS, where process creation is cheap. You get a free sandbox out of that, as well.
If you don't want to do that, jdi's suggestion of processes pools is a bit more work, but should work well as long as you don't have to kill processes very often.
If you really do want everything to be threads, you can do so, as long as you can restrict the way the jobs are written. If the jobs can always be cleanly unwound, you can kill them just by raising an exception. Of course they also have to not catch the specific exception you choose to raise. Obviously neither of these conditions is realistic as a general-purpose solution, but for your use case, it may be fine. The key is to make sure your code evolver never inserts any manual resource-management statements (like opening and closing a file); only with statements. (Alternatively, insert the open and close, but inside a try/finally.) And that's probably a good idea even if you're not doing things this way, because spinning off hundreds of processes that, e.g., each leak as many file handles as they can until they either time out or hit the file limit would slow your machine to a crawl.
If you can restrict the code generator/evolver even further, you could use some form of cooperative threading (e.g., greenlets), which makes things even nicer.
Finally, you could switch from CPython to a different Python implementation that can run multiple interpreter instances in a single process. I don't know whether jython or IronPython can do so. PyPy can do that, and also has a restricted-environment sandbox, but unfortunately I think both of those—and Python 3.x support—are not-ready-for-prime-time features, which means you either have to get a special build of PyPy (probably without the JIT optimizer), or build it yourself. This might be the best long-term solution, but it's probably not what you want today.

How do I run long term (infinite) Python processes?

I've recently started experimenting with using Python for web development. So far I've had some success using Apache with mod_wsgi and the Django web framework for Python 2.7. However I have run into some issues with having processes constantly running, updating information and such.
I have written a script I call "daemonManager.py" that can start and stop all or individual python update loops (Should I call them Daemons?). It does that by forking, then loading the module for the specific functions it should run and starting an infinite loop. It saves a PID file in /var/run to keep track of the process. So far so good. The problems I've encountered are:
Now and then one of the processes will just quit. I check ps in the morning and the process is just gone. No errors were logged (I'm using the logging module), and I'm covering every exception I can think of and logging them. Also I don't think these quitting processes has anything to do with my code, because all my processes run completely different code and exit at pretty similar intervals. I could be wrong of course. Is it normal for Python processes to just die after they've run for days/weeks? How should I tackle this problem? Should I write another daemon that periodically checks if the other daemons are still running? What if that daemon stops? I'm at a loss on how to handle this.
How can I programmatically know if a process is still running or not? I'm saving the PID files in /var/run and checking if the PID file is there to determine whether or not the process is running. But if the process just dies of unexpected causes, the PID file will remain. I therefore have to delete these files every time a process crashes (a couple of times per week), which sort of defeats the purpose. I guess I could check if a process is running at the PID in the file, but what if another process has started and was assigned the PID of the dead process? My daemon would think that the process is running fine even if it's long dead. Again I'm at a loss just how to deal with this.
Any useful answer on how to best run infinite Python processes, hopefully also shedding some light on the above problems, I will accept
I'm using Apache 2.2.14 on an Ubuntu machine.
My Python version is 2.7.2

I'll open by stating that this is one way to manage a long running process (LRP) -- not de facto by any stretch.
In my experience, the best possible product comes from concentrating on the specific problem you're dealing with, while delegating supporting tech to other libraries. In this case, I'm referring to the act of backgrounding processes (the art of the double fork), monitoring, and log redirection.
My favorite solution is http://supervisord.org/
Using a system like supervisord, you basically write a conventional python script that performs a task while stuck in an "infinite" loop.
#!/usr/bin/python
import sys
import time
def main_loop():
while 1:
# do your stuff...
time.sleep(0.1)
if __name__ == '__main__':
try:
main_loop()
except KeyboardInterrupt:
print >> sys.stderr, '\nExiting by user request.\n'
sys.exit(0)
Writing your script this way makes it simple and convenient to develop and debug (you can easily start/stop it in a terminal, watching the log output as events unfold). When it comes time to throw into production, you simply define a supervisor config that calls your script (here's the full example for defining a "program", much of which is optional: http://supervisord.org/configuration.html#program-x-section-example).
Supervisor has a bunch of configuration options so I won't enumerate them, but I will say that it specifically solves the problems you describe:
Backgrounding/Daemonizing
PID tracking (can be configured to restart a process should it terminate unexpectedly)
Log normally in your script (stream handler if using logging module rather than printing) but let supervisor redirect to a file for you.

You should consider Python processes as able to run "forever" assuming you don't have any memory leaks in your program, the Python interpreter, or any of the Python libraries / modules that you are using. (Even in the face of memory leaks, you might be able to run forever if you have sufficient swap space on a 64-bit machine. Decades, if not centuries, should be doable. I've had Python processes survive just fine for nearly two years on limited hardware -- before the hardware needed to be moved.)
Ensuring programs restart when they die used to be very simple back when Linux distributions used SysV-style init -- you just add a new line to the /etc/inittab and init(8) would spawn your program at boot and re-spawn it if it dies. (I know of no mechanism to replicate this functionality with the new upstart init-replacement that many distributions are using these days. I'm not saying it is impossible, I just don't know how to do it.)
But even the init(8) mechanism of years gone by wasn't as flexible as some would have liked. The daemontools package by DJB is one example of process control-and-monitoring tools intended to keep daemons living forever. The Linux-HA suite provides another similar tool, though it might provide too much "extra" functionality to be justified for this task. monit is another option.

I assume you are running Unix/Linux but you don't really say. I have no direct advice on your issue. So I don't expect to be the "right" answer to this question. But there is something to explore here.
First, if your daemons are crashing, you should fix that. Only programs with bugs should crash. Perhaps you should launch them under a debugger and see what happens when they crash (if that's possible). Do you have any trace logging in these processes? If not, add them. That might help diagnose your crash.
Second, are your daemons providing services (opening pipes and waiting for requests) or are they performing periodic cleanup? If they are periodic cleanup processes you should use cron to launch them periodically rather then have them run in an infinite loop. Cron processes should be preferred over daemon processes. Similarly, if they are services that open ports and service requests, have you considered making them work with INETD? Again, a single daemon (inetd) should be preferred to a bunch of daemon processes.
Third, saving a PID in a file is not very effective, as you've discovered. Perhaps a shared IPC, like a semaphore, would work better. I don't have any details here though.
Fourth, sometimes I need stuff to run in the context of the website. I use a cron process that calls wget with a maintenance URL. You set a special cookie and include the cookie info in with wget command line. If the special cookie doesn't exist, return 403 rather than performing the maintenance process. The other benefit here is login to the database and other environmental concerns of avoided since the code that serves normal web pages are serving the maintenance process.
Hope that gives you ideas. I think avoiding daemons if you can is the best place to start. If you can run your python within mod_wsgi that saves you having to support multiple "environments". Debugging a process that fails after running for days at a time is just brutal.

sandbox to execute possibly unfriendly python code [duplicate]

This question already has answers here:
How can I sandbox Python in pure Python?
(7 answers)
Python, safe, sandbox [duplicate]
(9 answers)
Closed 9 years ago.
Let's say there is a server on the internet that one can send a piece of code to for evaluation. At some point server takes all code that has been submitted, and starts running and evaluating it. However, at some point it will definitely bump into "os.system('rm -rf *')" sent by some evil programmer. Apart from "rm -rf" you could expect people try using the server to send spam or dos someone, or fool around with "while True: pass" kind of things.
Is there a way to coop with such unfriendly/untrusted code? In particular I'm interested in a solution for python. However if you have info for any other language, please share.

If you are not specific to CPython implementation, you should consider looking at PyPy[wiki] for these purposes — this Python dialect allows transparent code sandboxing.
Otherwise, you can provide fake __builtin__ and __builtins__ in the corresponding globals/locals arguments to exec or eval.
Moreover, you can provide dictionary-like object instead of real dictionary and trace what untrusted code does with it's namespace.
Moreover, you can actually trace that code (issuing sys.settrace() inside restricted environment before any other code executed) so you can break execution if something will go bad.
If none of solutions is acceptable, use OS-level sandboxing like chroot, unionfs and standard multiprocess python module to spawn code worker in separate secured process.

You can check pysandbox which does just that, though the VM route is probably safer if you can afford it.

It's impossible to provide an absolute solution for this because the definition of 'bad' is pretty hard to nail down.
Is opening and writing to a file bad or good? What if that file is /dev/ram?
You can profile signatures of behavior, or you can try to block anything that might be bad, but you'll never win. Javascript is a pretty good example of this, people run arbitrary javascript code all the time on their computers -- it's supposed to be sandboxed but there's all sorts of security problems and edge conditions that crop up.
I'm not saying don't try, you'll learn a lot from the process.
Many companies have spent millions (Intel just spent billions on McAffee) trying to understand how to detect 'bad code' -- and every day machines running McAffe anti-virus get infected with viruses. Python code isn't any less dangerous than C. You can run system calls, bind to C libraries, etc.

I would seriously consider virtualizing the environment to run this stuff, so that exploits in whatever mechanism you implement can be firewalled one more time by the configuration of the virtual machine.
Number of users and what kind of code you expect to test/run would have considerable influence on choices btw. If they aren't expected to link to files or databases, or run computationally intensive tasks, and you have very low pressure, you could be almost fine by just preventing file access entirely and imposing a time limit on the process before it gets killed and the submission flagged as too expensive or malicious.
If the code you're supposed to test might be any arbitrary Django extension or page, then you're in for a lot of work probably.

You can try some generic sanbox such as Sydbox or Gentoo's sandbox. They are not Python-specific.
Both can be configured to restrict read/write to some directories. Sydbox can even sandbox sockets.

I think a fix like this is going to be really hard and it reminds me of a lecture I attended about the benefits of programming in a virtual environment.
If you're doing it virtually its cool if they bugger it. It wont solve a while True: pass but rm -rf / won't matter.

Unless I'm mistaken (and I very well might be), this is much of the reason behind the way Google changed Python for the App Engine. You run Python code on their server, but they've removed the ability to write to files. All data is saved in the "nosql" database.
It's not a direct answer to your question, but an example of how this problem has been dealt with in some circumstances.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.