Best practice: Monitor processes

Best practice: Monitor processes - python

I was wondering what the best practice solution would be to constantly monitor and resart processes, because there are multiple ways in doing it.
Additional info:
I have a unix program which uses multiple processes to work. There's a main process, it always starts first and is not likely to die or terminate without stopping the program.
Then I spawn multiple "module" processes, which take care of some work and communicate through the main process. Those modules sometimes die because of exceptions, and because it's an external program I can't resolve the issues, so I have to restart them if they die.
I've made a program to check if any of the modules died and restart them, but I need to run it manually. My program checks if the pid files of the modules exist and if they listen on a specific tcp port. If the pid file doesn't exist or the socket can't establish connection, it restarts the module.
My thoughts so far:
Cron job to run the checks every minute and restart any dead modules. (kind of an overkill, because they don't die that frequently)
Daemon running in the background, which starts the modules and receives notifications if they die, so it doesn't have to check them constantly. (SIGCHLD signal, os.wait)
If I use the daemon method, how should I communicate with the daemon through my interface? (socket, or maybe a file which gets read if the daemon receives a specific signal)
Usually I would just go with the daemon because it seems to be the best practice method to restart the modules asap(cron only runs once a minute), but I've wanted to get some opinions from more experienced users. (I've never done something like this before, and asking doesn't hurt anyone :D)
I apologize if these questions are answered somewhere else, but I couldn't find any related question.
P.S. If I forgot something or you need more infos, please feel free to ask. :)

I would investigate running the monitoring process as part of a dedicated monitoring framework. Monit is one example, however there are of course others.
This has the advantage of providing additional features which might be useful, such as email alerts and analytics. In my experience, you should be able to use your existing program without too much modification, and Monit itself uses few system resources if that is a concern.

Related

Can I allow my server process to restart without killing existing connections?

In an attempt to make my terminal based program survive longer I was told to look into forking the process off of system. I can't find much specifying a PID to which I want to spawn a new process off of.
is this possible in Linux? I am a Windows guy mainly.
My program is going to be dealing with sockets and if my application crashed then I would lose lots of information. I was under the impression that if it was forked from system the sockets would stay alive?
EDIT: Here is what I am trying to do. I have multiple computers that I want to communicate with. So I am building a program that lets me listen on a socket(simple). Then I will connect to it from each of my remote computers(simple).
Once I have a connection I want to open a new terminal, and use my program to start interacting with the remote computer(simple).
The questions came from this portion.. The client shell will send all traffic to the main shell who will then send it out to the remote computer. When a response is received it goes to main shell and forwards it to client shell.
The issue is keeping each client shell in the loop. I want all client shells to know who is connected to who on each client shell. So client shell 1 should tell me if I have a client shell 2, 3, 4, 5, etc and who is connected to it. This jumped into sharing resources between different processes. So I was thinking about using local sockets to send data between all these client shells. But then I ran into a problem if the main shell were to die, everything is lost. So I wanted a way to try and secure it.
If that makes sense.

So, you want to be able to reload a program without losing your open socket connections?
The first thing to understand is that when a process exits, all open file descriptors are closed. This includes socket connections. Running as a daemon does not change that. A process becomes a daemon by becoming independent of your terminal sesssion, so that it will continue to run when your terminal sesssion ends. But, like any other process, when a daemon terminates for any reason (normal exit, crashed, killed, machine is restarted, etc), then all connections to it cease to exist. BTW this is not specific to unix, Windows is the same.
So, the short answer to your question is NO, there's no way to tell unix/linux to not close your sockets when your process stops, it will close them and that's that.
The long answer is, there are a few ways to re-engineer things to get around this:
1) You can have your program exec() itself when you send it a special message or signal (eg SIGHUP). In unix, exec (or its several variants), does not end or start any process, it simply loads code into the current process and starts execution. The new code takes the place of the old within the same process. Since the process remains the same, any open files remain open. However you will lose any data that you had in memory, so the sockets will be open, but your program will know nothing about them. On startup you'd have to use various system calls to discover which descriptors are open in your process and whether any of them are socket connections to clients. One way to get around this would be to pass critical information as command line arguments or environment variables which can be passed through the exec() call and thus preserved for use of the new code when it starts executing.
Keep in mind that this only works when the process calls exec ITSELF while it is still running. So you cannot recover from a crash or any other cause of your process ending.. your connections will be gone. But this method does solve the problem of you wanting to load new code without losing your connections.
2) You can bypass the issue by dividing your server (master) into two processes. The first (call it the "proxy") accepts the TCP connections from the clients and keeps them open. The proxy can never exit, so it should be kept so simple that you'll rarely want to change that code. The second process runs the "worker", which is the code that implements your application logic. All the code you might want to change often should go in the worker. Now all you need do establish interprocess communication from the proxy to the worker, and make sure that if the worker exits, there's enough information in the proxy to re-establish your application state when the worker starts up again. In a really simple, low volume application, the mechanism can be as simple as the proxy doing a fork() + exec() of the worker each time it needs to do something. A fancier way to do this, which I have used with good results, is a unix domain datagram (SOCK_DGRAM) socket. The proxy receives messages from the clients, forwards them to the worker through the datagram socket, the worker does the work, and responds with the result back to the proxy, which in turn forwards it back to the client. This works well because as long as the proxy is running and has opened the unix domain socket, the worker can restart at will. Shared memory can also work as a way to communicate between proxy and worker.
3) You can use the unix domain socket along with the sendmesg() and recvmsg() functions along with the SCM_RIGHTS flag to pass not the client data itself, but to actually send the open socket file descriptors from the old instance to the new. This is the only way to pass open file descriptors between unrelated processes. Using this mechanism, there are all sorts of strategies you can implement.. for example, you could start a new instance of your master program, and have it connect (via a unix domain socket) to the old instance and transfer all the sockets over. Then your old instance can exit. Or, you can use the proxy/worker model, but instead of passing messages through the proxy, you can just have the proxy hand the socket descriptor to the worker via the unix domain socket between them, and then the worker can talk directly to the client using that descriptor. Or, you could have your master send all its socket file descriptors to another "stash" process that holds on to them in case the master needs to restart. There are all sorts of architectures possible. Keeping in mind that the operating system just provides the ability to ship the descriptors around, all the other logic you have to code for yourself.
4) You can accept that no matter how careful you are, inevitably connections will be lost. Networks are unreliable, programs crash sometimes, machines are restarted. So rather than going to significant effort to make sure your connections don't close, you can instead engineer your system to recover when they inevitably do.
The simplest approach to this would be: Since your clients know who they wish to connect to, you could have your client processes run a loop where, if the connection to the master is lost for any reason, they periodically try to reconnect (let's say every 10-30 seconds), until they succeed. So all the master has to do is to open up the rendezvous (listening) socket and wait, and the connections will be re-established from every client that is still out there running. The client then has to re-send any information it has which is necessary to re-establish proper state in the master.
The list of connected computers can be kept in the memory of the master, there is no reason to write it to disk or anywhere else, since when the master exits (for any reason), those connections don't exist anymore. Any client can then connect to your server (master) process and ask it for a list of clients that are connected.
Personally, I would take this last approach. Since it seems that in your system, the connections themselves are much more valuable than the state of the master, being able to recover them in the event of a loss would be the first priority.
In any case, since it seems that the role of the master is to simply pass data back and forth among clients, this would be a good application of "asynchronous" socket I/O using the select() or poll() functions, this allows you to communicate between multiple sockets in one process without blocking. Here's a good example of a poll() based server that accepts multiple connections:
https://www.ibm.com/support/knowledgecenter/ssw_ibm_i_71/rzab6/poll.htm
As far as running your process "off System".. in Unix/Linux this is referred to running as a daemon. In *ix, these processes are children of process id 1, the init process.. which is the first process that starts when the system starts. You can't tell your process to become a child of init, this happens automatically when the existing parent exits. All "orphaned" processes are adopted by init. Since there are many easily found examples of writing a unix daemon (at this point the code you need to write to do this has become pretty standardized), I won't paste any code here, but here's one good example I found: http://web.archive.org/web/20060603181849/http://www.linuxprofilm.com/articles/linux-daemon-howto.html#ss4.1
If your linux distribution uses systemd (a recent replacement for init in some distributions), then you can do it as a systemd service, which is systemd's idea of a daemon but they do some of the work for you (for better or for worse.. there's a lot of complaints about systemd.. wars have been fought just about)...

Forking from your own program, is one approach - however a much simpler and easier one is to create a service. A service is a little wrapper around your program that deals with keeping it running, restarting it if it fails and providing ways to start and stop it.
This link shows you how to write a service. Although its specifically for a web server application, the same logic can be applied to anything.
https://medium.com/#benmorel/creating-a-linux-service-with-systemd-611b5c8b91d6
Then to start the program you would write:
sudo systemctl start my_service_name
To stop it:
sudo systemctl stop my_service_name
To view its outputs:
sudo journalctl -u my_service_name

Spawn a subprocess but kill it if main process gets killed

I am creating a program in Python that listens to varios user interactions and logs them. I have these requirements/restrictions:
I need a separate process that sends those logs to a remote database every hour
I can't do it in the current process because it blocks the UI.
If the main process stops, the background process should also stop.
I've been reading about subprocess but I can't seem to find anything on how to stop both simultaneously. I need the equivalent of spawn_link if anybody know some Erlang/Elixir.
Thanks!

To answer the question in the title (for visitors from google): there are robust solutions on Linux, Windows using OS-specific APIs and less robust but more portable psutil-based solutions.
To fix your specific problem (it is XY problem): use a daemon thread instead of a process.
A thread would allow to perform I/O without blocking GUI, code example (even if GUI you've chosen doesn't provide async. I/O API such as tkinter's createfilehandler() or gtk's io_add_watch()).

How to detect unresponsive/frozen processes?

I have several scripts that I use to do some web crawling. They are always running, and should never stop. However, after about a week, they systematically "freeze": there is no output anymore, no response to Ctrl+C or anything. The only way is to kill the process and restart it.
I suspect that these issues come from the library I use for retrieving the data (urllib2), but the issue is very hard to reproduce.
I am thus wondering how I could check the state of the process and kill/restart it automatically if it is frozen. I was thinking of creating a PID file, and update it regularly. Another script could then periodically check the last modification date of this PID file, and restart the process if it's too old. I could use something like Monit to do the monitoring.
Is this how I should do it? Is there another best practice/common way for checking the responsiveness of a process?

If you have a process that is always running, has no connected terminal, and is the process group leader - that is a daemon. You undoubtedly know all that.
There are some defacto practices in coding programs like that. One is to have a signal handler which takes SIGHUP and forces the program to reinitialize itself. This means closing all of the open log files, rereading config scripts, etc. I do not know how applicable that is to your problem but it sometimes solves issues like frozen daemons at my work.
You can customize the idea by employing SIGUSR1 and SIGUSR2 signals to do special things, like write status to a file, or anything else. Since signals come in on an interrupt, the trap statement in scripts and signal handlers in python itself will push program state onto the interrupt stack and do "stuff".
In your case you may want the program fork/exec itself and then kill the parent.

How do I run long term (infinite) Python processes?

I've recently started experimenting with using Python for web development. So far I've had some success using Apache with mod_wsgi and the Django web framework for Python 2.7. However I have run into some issues with having processes constantly running, updating information and such.
I have written a script I call "daemonManager.py" that can start and stop all or individual python update loops (Should I call them Daemons?). It does that by forking, then loading the module for the specific functions it should run and starting an infinite loop. It saves a PID file in /var/run to keep track of the process. So far so good. The problems I've encountered are:
Now and then one of the processes will just quit. I check ps in the morning and the process is just gone. No errors were logged (I'm using the logging module), and I'm covering every exception I can think of and logging them. Also I don't think these quitting processes has anything to do with my code, because all my processes run completely different code and exit at pretty similar intervals. I could be wrong of course. Is it normal for Python processes to just die after they've run for days/weeks? How should I tackle this problem? Should I write another daemon that periodically checks if the other daemons are still running? What if that daemon stops? I'm at a loss on how to handle this.
How can I programmatically know if a process is still running or not? I'm saving the PID files in /var/run and checking if the PID file is there to determine whether or not the process is running. But if the process just dies of unexpected causes, the PID file will remain. I therefore have to delete these files every time a process crashes (a couple of times per week), which sort of defeats the purpose. I guess I could check if a process is running at the PID in the file, but what if another process has started and was assigned the PID of the dead process? My daemon would think that the process is running fine even if it's long dead. Again I'm at a loss just how to deal with this.
Any useful answer on how to best run infinite Python processes, hopefully also shedding some light on the above problems, I will accept
I'm using Apache 2.2.14 on an Ubuntu machine.
My Python version is 2.7.2

I'll open by stating that this is one way to manage a long running process (LRP) -- not de facto by any stretch.
In my experience, the best possible product comes from concentrating on the specific problem you're dealing with, while delegating supporting tech to other libraries. In this case, I'm referring to the act of backgrounding processes (the art of the double fork), monitoring, and log redirection.
My favorite solution is http://supervisord.org/
Using a system like supervisord, you basically write a conventional python script that performs a task while stuck in an "infinite" loop.
#!/usr/bin/python
import sys
import time
def main_loop():
while 1:
# do your stuff...
time.sleep(0.1)
if __name__ == '__main__':
try:
main_loop()
except KeyboardInterrupt:
print >> sys.stderr, '\nExiting by user request.\n'
sys.exit(0)
Writing your script this way makes it simple and convenient to develop and debug (you can easily start/stop it in a terminal, watching the log output as events unfold). When it comes time to throw into production, you simply define a supervisor config that calls your script (here's the full example for defining a "program", much of which is optional: http://supervisord.org/configuration.html#program-x-section-example).
Supervisor has a bunch of configuration options so I won't enumerate them, but I will say that it specifically solves the problems you describe:
Backgrounding/Daemonizing
PID tracking (can be configured to restart a process should it terminate unexpectedly)
Log normally in your script (stream handler if using logging module rather than printing) but let supervisor redirect to a file for you.

You should consider Python processes as able to run "forever" assuming you don't have any memory leaks in your program, the Python interpreter, or any of the Python libraries / modules that you are using. (Even in the face of memory leaks, you might be able to run forever if you have sufficient swap space on a 64-bit machine. Decades, if not centuries, should be doable. I've had Python processes survive just fine for nearly two years on limited hardware -- before the hardware needed to be moved.)
Ensuring programs restart when they die used to be very simple back when Linux distributions used SysV-style init -- you just add a new line to the /etc/inittab and init(8) would spawn your program at boot and re-spawn it if it dies. (I know of no mechanism to replicate this functionality with the new upstart init-replacement that many distributions are using these days. I'm not saying it is impossible, I just don't know how to do it.)
But even the init(8) mechanism of years gone by wasn't as flexible as some would have liked. The daemontools package by DJB is one example of process control-and-monitoring tools intended to keep daemons living forever. The Linux-HA suite provides another similar tool, though it might provide too much "extra" functionality to be justified for this task. monit is another option.

I assume you are running Unix/Linux but you don't really say. I have no direct advice on your issue. So I don't expect to be the "right" answer to this question. But there is something to explore here.
First, if your daemons are crashing, you should fix that. Only programs with bugs should crash. Perhaps you should launch them under a debugger and see what happens when they crash (if that's possible). Do you have any trace logging in these processes? If not, add them. That might help diagnose your crash.
Second, are your daemons providing services (opening pipes and waiting for requests) or are they performing periodic cleanup? If they are periodic cleanup processes you should use cron to launch them periodically rather then have them run in an infinite loop. Cron processes should be preferred over daemon processes. Similarly, if they are services that open ports and service requests, have you considered making them work with INETD? Again, a single daemon (inetd) should be preferred to a bunch of daemon processes.
Third, saving a PID in a file is not very effective, as you've discovered. Perhaps a shared IPC, like a semaphore, would work better. I don't have any details here though.
Fourth, sometimes I need stuff to run in the context of the website. I use a cron process that calls wget with a maintenance URL. You set a special cookie and include the cookie info in with wget command line. If the special cookie doesn't exist, return 403 rather than performing the maintenance process. The other benefit here is login to the database and other environmental concerns of avoided since the code that serves normal web pages are serving the maintenance process.
Hope that gives you ideas. I think avoiding daemons if you can is the best place to start. If you can run your python within mod_wsgi that saves you having to support multiple "environments". Debugging a process that fails after running for days at a time is just brutal.

starting my own threads within python paste

I'm writing a web application using pylons and paste. I have some work I want to do after an HTTP request is finished (send some emails, write some stuff to the db, etc) that I don't want to block the HTTP request on.
If I start a thread to do this work, is that OK? I always see this stuff about paste killing off hung threads, etc. Will it kill my threads which are doing work?
What else can I do here? Is there a way I can make the request return but have some code run after it's done?
Thanks.

You could use a thread approach (maybe setting the Thead.daemon property would help--but I'm not sure).
However, I would suggest looking into a task queuing system. You can place a task on a queue (which is very fast), then a listener can handle the tasks asynchronously, allowing the HTTP request to return quickly. There are two task queues that I know of for Django:
Django Queue Service
Celery
You could also consider using an more "enterprise" messaging solution, such as RabbitMQ or ActiveMQ.
Edit: previous answer with some good pointers.

I think the best solution is messaging system because it can be configured to not loose the task if the pylons process goes down. I would always use processes over threads especially in this case. If you are using python 2.6+ use the built in multiprocessing or you can always install the processing module which you can find on pypi (I can't post link because of I am a new user).

Take a look at gearman, it was specifically made for farming out tasks to 'workers' to handle. They can even handle it in a different language entirely. You can come back and ask if the task was completed, or just let it complete. That should work well for many tasks.
If you absolutely need to ensure it was completed, I'd suggest queuing tasks in a database or somewhere persistent, then have a separate process that runs through it ensuring each one gets handled appropriately.

To answer your basic question directly, you should be able to use threads just as you'd like. The "killing hung threads" part is paste cleaning up its own threads, not yours.
There are other packages that might help, etc, but I'd suggest you start with simple threads and see how far you get. Only then will you know what you need next.
(Note, "Thread.daemon" should be mostly irrelevant to you here. Setting that true will ensure a thread you start will not prevent the entire process from exiting. Doing so would mean, however, that if the process exited "cleanly" (as opposed to being forced to exit) your thread would be terminated even if it wasn't done its work. Whether that's a problem, and how you handle things like that, depend entirely on your own requirements and design.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.