Script hanging on time.sleep() - python

I have a script started with nohup python3 script.py & . It looks something like this:
import thing
import anotherthing
logfile = "logfile {}".format(datetime.datetime.today())
while True:
try:
logging.debug("Started loop.")
do_some_stuff()
logging.debug("Stuff was done.")
except Exception as e:
logging.exception("message")
logging.debug("Starting sleep.")
time.sleep(60)
This works fine, however it seems to hang up on time.sleep() (as in it just stops doing anything without killing the process) after about 2 days. According to logs, all parts of the script execute fine, but it always hangs up on the sleep part and doesn't start back. I checked for memory leaks, i/o hangups and connection timeouts, and none of those seem to be the case.
What could be the cause of that behavior and why?
EDIT: Added logging to pinpoint the cause. Logs always finish on DEBUG Starting Sleep.

Related

Popen.wait never returning with docker-compose

I am developing a wrapper around docker compose with python.
However, I struggle with Popen.
Here is how I launch launch it :
import subprocess as sp
argList=['docker-compose', 'up']
env={'HOME': '/home/me/somewhere'}
p = sp.Popen(argList, env=env)
def handler(signum, frame):
p.send_signal(signum)
for s in (signal.SIGINT,):
signal.signal(s, handler) # to redirect Ctrl+C
p.wait()
Everything works fine, when I hit Ctrl+C, docker-compose kills gracelly the container, however, p.wait() never returns...
Any hint ?
NOTE : While writing the question, I though I needed to check if p.wait() does actually return and if the block is after (it's the last instruction in the script). Adding a print after it end in the process exiting normally, any further hints on this behavior ?
When I run your code as written, it works as intended in that it causes docker-compose to exit and then p.wait() returns. However, I occasionally see this behavior:
Killing example_service_1 ... done
ERROR: 2
I think that your code may end up delivering SIGINT twice to docker-compose. That is, I think docker-compose receives an initial SIGINT when you type CTRL-C, because it has the same controlling terminal as your Python script, and then you explicitly deliver another SIGINT in your handler function.
I don't always see this behavior, so it's possible my explanation is incorrect.
In any case, I think the correct solution here is imply to ignore SIGINT in your Python code:
import signal
import subprocess
argList = ["docker-compose", "up"]
p = subprocess.Popen(argList)
signal.signal(signal.SIGINT, signal.SIG_IGN) # to redirect Ctrl+C
p.wait()
With this implementation, your Python code ignores the SIGINT generated by CTRL-C, but it is received and processed normally by docker-compose.

How to still execute finally block when the code is stopped through Task Scheduler?

I have this code example:
import time
from datetime import datetime
def log_info(message: str):
with open('somefile.log', 'a') as file:
file.write(f'{datetime.now()}: {message}\n')
try:
log_info('Process started')
time.sleep(1000) # to simulate long running...
finally:
log_info('Process ended')
When I run the code in PyCharm (even in debug mode with breakpoints) or just in console/terminal and after some time I stop the running, the message "Process ended" is still written to the file. This behavior is correct.
However if I create a task in Windows Task Scheduler, I run the task and stop it (through the Task Scheduler), the "Process ended" message is not logged.
How to fix it?
By stopping your program via Task Scheduler, the Python interpreter is unexpectedly stopping, and cannot reach the finally block. The finally block only applies when your program is able to run without unexpectedly stopping.

python script see traceback when running as background

I have a python script running like this on my server:
python script.py &
The script works fine, but constantly I'm adding new things to the script and re-running it, somedays it runs for days without any problem, but sometimes the script stops running (Not running out of memory), but since I started the script as background I have no idea how to check for the Exception or error that cause the script to stop running. I'm on a Ubuntu server box running in Amazon. Any advice on how to approach this inconvenience ?
I use something like this. It will dump the exception which caused termination to your syslog, which you can see by examining /var/log/syslog after your script has stopped.
import traceback
import syslog
def syslog_trace(trace):
'''Log a python stack trace to syslog'''
log_lines = trace.split('\n')
for line in log_lines:
if len(line):
syslog.syslog(line)
def main():
# Your actual program here
if __name__ == '__main__':
try:
main()
except:
syslog_trace(traceback.format_exc())

python-daemon context fails to start when a stale PID file is present

I'm using python-daemon, and having the problem that when I kill -9 a process, it leaves a pidfile behind (ok) and the next time I run my program it doesn't work unless I have already removed the pidfile by hand (not ok).
I catch all exceptions in order that context.close() is called before terminating -- when this happens (e.g. on a kill) the /var/run/mydaemon.pid* files are removed and a subsequent daemon run succeeds. However, when using SIGKILL (kill -9), I don't have the chance to call context.close(), and the /var/run files remain. In this instance, the next time I run my program it does not start successfully -- the original process returns, but the daemonized process blocks at context.open().
It seems like python-daemon ought to be noticing that there is a pidfile for a process that no longer exists, and clearing it out, but that isn't happening. Am I supposed to be doing this by hand?
Note: I'm not using with because this code runs on Python 2.4
from daemon import DaemonContext
from daemon.pidlockfile import PIDLockFile
context = DaemonContext(pidfile = PIDLockFile("/var/run/mydaemon.pid"))
context.open()
try:
retry_main_loop()
except Exception, e:
pass
context.close()
If you are running linux, and process level locks are acceptable, read on.
We try to acquire the lock. If it fails, check if the lock is acquired by a running process. If no, break the lock and continue.
from lockfile.pidlockfile import PIDLockFile
from lockfile import AlreadyLocked
pidfile = PIDLockFile("/var/run/mydaemon.pid", timeout=-1)
try:
pidfile.acquire()
except AlreadyLocked:
try:
os.kill(pidfile.read_pid(), 0)
print 'Process already running!'
exit(1)
except OSError: #No process with locked PID
pidfile.break_lock()
#pidfile can now be used to create DaemonContext
Edit: Looks like PIDLockFile is available only on lockfile >= 0.9
With the script provided here
the pid file remains on kill -9 as you say, but the script also cleans up properly on a restart.

Daemon dies unexpectedly

I have a python script, which I daemonise using this code
def daemonise():
from os import fork, setsid, umask, dup2
from sys import stdin, stdout, stderr
if fork(): exit(0)
umask(0)
setsid()
if fork(): exit(0)
stdout.flush()
stderr.flush()
si = file('/dev/null', 'r')
so = file('daemon-%s.out'%os.getpid(), 'a+')
se = file('daemon-%s.err'%os.getpid(), 'a+')
dup2(si.fileno(), stdin.fileno())
dup2(so.fileno(), stdout.fileno())
dup2(se.fileno(), stderr.fileno())
print 'this file has the output from daemon%s'%os.getpid()
print >> stderr, 'this file has the errors from daemon%s'%os.getpid()
The script is in
while True: try: funny_code(); sleep(10); except:pass;
loop. It runs fine for a few hours and then dies unexpectedly. How do I go about debugging such demons, err daemons.
[Edit]
Without starting a process like monit, is there a way to write a watchdog in python, which can watch my other daemons and restart when they go down? (Who watches the watchdog.)
You really should use python-daemon for this which is a library that implements PEP 3141 for a standard daemon process library. This way you will ensure that your application does all the right things for whichever type of UNIX it is running under. No need to reinvent the wheel.
Why are you silently swallowing all exceptions? Try to see what exceptions are being caught by this:
while True:
try:
funny_code()
sleep(10)
except BaseException, e:
print e.__class__, e.message
pass
Something unexpected might be happening which is causing it to fail, but you'll never know if you blindly ignore all the exceptions.
I recommend using supervisord (written in Python, very easy to use) for daemonizing and monitoring processes. Running under supervisord you would not have to use your daemonise function.
What I've used in my clients is daemontools. It is a proven, well tested tool to run anything daemonized.
You just write your application without any daemonization, to run on foreground; Then create a daemontools service folder for it, and it will discover and automatically restart your application from now on, and every time the system restarts.
It can also handle log rotation and stuff. Saves a lot of tedious, repeated work.

Categories