How to handle job cancelation in Slurm? - python

I am using Slurm job manager on an HPC cluster. Sometimes there are situations, when a job is canceled due to time limit and I would like to finish my program gracefully.
As far as I understand, the process of cancellation occurs in two stages exactly for a software developer to be able to finish the program gracefully:
srun: Job step aborted: Waiting up to 62 seconds for job step to finish.
slurmstepd: error: *** JOB 18522559 ON ncm0317 CANCELLED AT 2020-12-14T19:42:43 DUE TO TIME LIMIT ***
You can see that I am given 62 seconds to finish the job the way I want it to finish (by saving some files, etc.).
Question: how to do this? I understand that first some Unix signal is sent to my job and I need to respond to it correctly. However, I cannot find in the Slurm documentation any information on what this signal is. Besides, I do not exactly how to handle it in Python, probably, through exception handling.

In Slurm, you can decide which signal is sent at which moment before your job hits the time limit.
From the sbatch man page:
--signal=[[R][B]:]<sig_num>[#<sig_time>]
When a job is within sig_time seconds of its end time, send it the signal sig_num.
So set
#SBATCH --signal=B:TERM#05:00
to get Slurm to signal the job with SIGTERM 5 minutes before the allocation ends. Note that depending on how you start your job, you might need to remove the B: part.
In your Python script, use the signal package. You need to define a "signal handler", a function that will be called when the signal is receive, and "register" that function for a specific signal. As that function is disrupting the normal flow when called , you need to keep it short and simple to avoid unwanted side effects, especially with multithreaded code.
A typical scheme in a Slurm environment is to have a script skeleton like this:
#! /bin/env python
import signal, os, sys
# Global Boolean variable that indicates that a signal has been received
interrupted = False
# Global Boolean variable that indicates then natural end of the computations
converged = False
# Definition of the signal handler. All it does is flip the 'interrupted' variable
def signal_handler(signum, frame):
global interrupted
interrupted = True
# Register the signal handler
signal.signal(signal.SIGTERM, signal_handler)
try:
# Try to recover a state file with the relevant variables stored
# from previous stop if any
with open('state', 'r') as file:
vars = file.read()
except:
# Otherwise bootstrap (start from scratch)
vars = init_computation()
while not interrupted and not converged:
do_computation_iteration()
# Save current state
if interrupted:
with open('state', 'w') as file:
file.write(vars)
sys.exit(99)
sys.exit(0)
This first tries to restart computations left by a previous run of the job, and otherwise bootstraps it. If it was interrupted, it lets the current loop iteration finish properly, and then saves the needed variables to disk. It then exits with the 99 return code. This allows, if Slurm is configured for it, to requeue the job automatically for further iteration.
If slurm is not configured for it, you can do it manually in the submission script like this:
python myscript.py || scontrol requeue $SLURM_JOB_ID

In most programming languages, Unix signals are captured using a callback. Python is no exception. To catch Unix signals using Python, just use the signal package.
For example, to gracefully exit:
import signal, sys
def terminate_signal(signalnum, handler):
print ('Terminate the process')
# save results, whatever...
sys.exit()
# initialize signal with a callback
signal.signal(signal.SIGTERM, terminate_signal)
while True:
pass # work
List of possible signals. SIGTERM is the one used to "politely ask a program to terminate".

Related

How to gracefully stop a Kubernetes Watch on Services when the system exits

I have the following KOPF Daemon running:
import kopf
import kubernetes
#kopf.on.daemon(group='test.example.com', version='v1', plural='myclusters')
def worker_services(namespace, name, spec, status, stopped, logger, **kwargs):
config = kubernetes.client.Configuration()
client = kubernetes.client.ApiClient(config)
workload = kubernetes.client.CoreV1Api(client)
watch = kubernetes.watch.Watch()
while not stopped:
for e in watch.stream(workload.list_service_for_all_namespaces):
svc = e['object']
lb = helpers.get_service_loadbalancer(name, namespace, svc, logger)
if "NodePort" in svc.spec.type:
logger.info(f"Found Service of type NodePort: {svc.metadata.name}")
do_some_work(svc)
watch.stop()
When the system exits by means of Ctrl + C or Kubernetes killing the pod, I get the following Warning:
INFO:kopf.reactor.running:Signal SIGINT is received. Operator is stopping.
[2020-12-11 15:07:52,107] kopf.reactor.running [INFO ] Signal SIGINT is received. Operator is stopping.
WARNING:kopf.objects:Daemon 'worker_services' did not exit in time. Leaving it orphaned.
[2020-12-11 15:07:52,113] kopf.objects [WARNING ] Daemon 'worker_services' did not exit in time. Leaving it orphaned.
This keeps the process running in the background even if I press Ctrl + Z.
I believe the for loop is holding up the process with the stream and does not terminate when the system exits, thus it is not hitting the watch.stop() on the last line of this snippet.
I have tried the following thus far:
Adding a watch.stop() after the do_some_work(svc), but this sends my program into a very aggressive loop consuming up to 90% of my CPU
Putting the whole for loop on a different thread, this made some components fail such as the logger
Implemented yield e to make the process non-blocking, this made the daemon complete after the first service that it watched and the watch ended
Implemented Signal Listeners using signal library to listen for SIGINT and then watch.stop() in the exit function, but the function never got called
Implemented cancellation_timeout=3.0 i.e. #kopf.on.daemon(group='test.example.com', version='v1', plural='myclusters', cancellation_timeout=3.0) with some of the above mentioned solutions, but also with no success
Any input would be appreciated, thanks in advance.
What I see in your example is that the code tries to watch over the resources in the cluster. However, it uses the official client library, which is synchronous. Synchronous functions (or threads) cannot be interrupted in Python, unlike asynchronous (which also require async i/o to be used). Once the function shown here is called, it never exits and has no point where it checks for the stopped flag during that long run.
What you can do with the current code, is to check for the stopped flag more often:
#kopf.on.daemon(group='test.example.com', version='v1', plural='myclusters')
def worker_services(namespace, name, spec, status, stopped, logger, **kwargs):
…
watch = kubernetes.watch.Watch()
for e in watch.stream(workload.list_service_for_all_namespaces):
if stopped: # <<<< check inside of the for-loop
break
svc = …
………
watch.stop()
This will check if the daemon is stopped on every event of every service. However, it will not check for the stop-flag if there is complete silence (it happens).
To work around that, you can limit the watch by time (please check with the client's documentation on how this is done properly, but iirc, this way):
watch = kubernetes.watch.Watch()
for e in watch.stream(workload.list_service_for_all_namespaces, timeout_seconds=123):
That will limit the daemon's time of no response/cancellation to 123 seconds at most — in case of no services are available in the cluster or they are not changed.
For that case, you have no need to check for the stopped condition outside of the for-loop, as the daemon function will exit with an intention to be restarted, the stopped will be checked by the framework, and it will not restart the function — as intended.
On a side note, I should notice that watching for resources inside of the handlers might be not the best idea. Watching is complicated. Way too complicated with all the edge cases and issues it brings.
And since the framework already does the watching, it might be easier to utilise that, and implement the cross-resource connectivity via the operator's global state:
import queue
import kopf
SERVICE_QUEUES = {} # {(mc_namespace, mc_name) -> queue.Queue}
KNOWN_SERVICES = {} # {(svc_namespace, svc_name) -> svc_body}
#kopf.on.event('v1', 'services')
def service_is_seen(type, body, meta, event, **_):
for q in SERVICE_QUEUES.values(): # right, to all MyClusters known to the moment
q.put(event)
if type == 'DELETED' or meta.get('deletionTimestamp'):
if (namespace, name) in KNOWN_SERVICES:
del KNOWN_SERVICES[(namespace, name)]
else:
KNOWN_SERVICES[(namespace, name)] = body
#kopf.on.daemon(group='test.example.com', version='v1', plural='myclusters')
def worker_services(namespace, name, spec, status, stopped, logger, **kwargs):
# Start getting the updates as soon as possible, to not miss anything while handling the "known" services.
q = SERVICE_QUEUES[(namespace, name)] = queue.Queue()
try:
# Process the Services known before the daemon start/restart.
for (svc_namespace, svc_name), svc in KNOWN_SERVICES.items():
if not stopped:
lb = helpers.get_service_loadbalancer(name, namespace, svc, logger)
if "NodePort" in svc.spec['type']:
logger.info(f"Found Service of type NodePort: {svc.metadata.name}")
do_some_work(svc)
# Process the Services arriving after the daemon start/restart.
while not stopped:
try:
svc_event = q.get(timeout=1.0)
except queue.Empty:
pass
else:
svc = svc_event['object']
lb = helpers.get_service_loadbalancer(name, namespace, svc, logger)
if "NodePort" in svc.spec['type']:
logger.info(f"Found Service of type NodePort: {svc.metadata.name}")
do_some_work(svc)
finally:
del SERVICE_QUEUES[(namespace, name)]
It is a simplified example (but might work "as is" — I didn't check) — only to show an idea on how to make the resources talk to each other while using the framework's capabilities.
The solution depends on the use-case, and this solution might be not applicable in your intended case. Maybe I miss something on why it is done that way. It would be good if you report your use-case to Kopf's repo as a feature request so that it could be supported by the framework later.

Python Multiprocessing Process.start() wait for process to be started

I have some testcases where I start a webserver process and then
run some URL tests to check if every function runs fine.
The server process start-up time is depending on the system where it is executed. It's a matter of seconds and I work with a time.sleep(5) for now.
But honestly I'm not a huge fan of sleep() since it might work for my systems but what if the test runs on a system where server needs 6 secs to start ... (so it's never really safe to go that way..)
Tests will fail for no reason at all.
So the question is: is there a nice way to check if the process really started.
I use the python multiprocessing module
Example:
from multiprocessing import Process
import testapp.server
import requests
import testapp.config as cfg
import time
p = Process(target=testapp.server.main)
p.start()
time.sleep(5)
testurl=cfg.server_settings["protocol"] + cfg.server_settings["host"] + ":" +str(cfg.server_settings["port"]) + "/test/12"
r = requests.get(testurl)
p.terminate()
assert int(r.text)==12
So it would be nice to avoid the sleep() and really check when the process started ...
You should use is_alive (docs) but that would almost always return True after you initiated start() on the process. If you want to make sure the process is already doing something important, there's no getting around the time.sleep (at least from this end, look at the last paragraph for another idea)
In any case, you could implement is_alive like this:
p = Process(target=testapp.server.main)
p.start()
while not p.is_alive():
time.sleep(0.1)
do_something_once_alive()
As you can see we still need to "sleep" and check again (just 0.1 seconds), but it will probably be much less than 5 seconds until is_alive returns True.
If both is_alive and time.sleep aren't accurate enough for you to know if the process really does something specific yet, and if you're controlling the other program as well, you should have it raise another kind of flag so you know you're good to go.
I suggest creating your process with a connection object as argument (other synchronization primitives may work) and use the send() method within your child process to notify your parent process that business can go on. Use the recv() method on the parent end of the connection object.
import multiprocessing as mp
def worker(conn):
conn.send(0) # argument object must be pickable
# your worker is ready to do work and just signaled it to the parent
out_conn, in_conn = mp.Pipe()
process = mp.Process(target=worker,
args=(out_conn,))
process.start()
in_conn.recv() # Will block until something is received
# worker in child process signaled it is ready. Business can go on

signal.alarm not triggering exception on time

I've slightly modified the signal example from the official docs (bottom of page).
I'm calling sleep 10 but I would like an alarm to be raised after 1 second. When I run the following snippet it takes way more than 1 second to trigger the exception (I think it runs the full 10 seconds).
import signal, os
def handler(signum, frame):
print 'Interrupted', signum
raise IOError("Should after 1 second")
signal.signal(signal.SIGALRM, handler)
signal.alarm(1)
os.system('sleep 10')
signal.alarm(0)
How can I be sure to terminate a function after a timeout in a single-threaded application?
From the docs:
A Python signal handler does not get executed inside the low-level (C)
signal handler. Instead, the low-level signal handler sets a flag
which tells the virtual machine to execute the corresponding Python
signal handler at a later point(for example at the next bytecode
instruction).
Therefore, a signal such as that generated by signal.alarm() can't terminate a function after a timeout in some cases. Either the function should cooperate by allowing other Python code to run (e.g., by calling PyErr_CheckSignals() periodically in C code) or you should use a separate process, to terminate the function in time.
Your case can be fixed if you use subprocess.check_call('sleep 10'.split()) instead of os.system('sleep 10').

Anything similar to a microcontroller interrupt handler?

Is there some method where one could use a try statement to catch an error caused by a raise statement, execute code to handle the flag e.g. update some variables and then return to the line where the code had been operating when the flag was raised?
I am thinking specifically of an interrupt handler for a micro-controller (which does what ive just described).
I am writing some code that has a thread checking a file to see if it updates and I want it to interrupt the main program so it is aware of the update, deals with it appropriately, and returns to the line it was running when interrupted.
Ideally, the main program would recognize the flag from the thread regardless of where it is in execution. A try statement would do this but how could I return to the line where the flag was raised?
Thanks!
Paul
EDIT:
My attempt at ISR after comments albeit it looks like a pretty straight forward example of using locks. Small test routine at the bottom to demonstrate code
import os
import threading
import time
def isr(path, interrupt):
prev_mod = os.stat(path).st_mtime
while(1):
new_mod = os.stat(path).st_mtime
if new_mod != prev_mod:
print "Updates! Waiting to begin"
# Prevent enter into critical code and updating
# While the critical code is running.
with interrupt:
print "Starting updates"
prev_mod = new_mod
print "Fished updating"
else:
print "No updates"
time.sleep(1)
def func2(interrupt):
while(1):
with interrupt: # Prevent updates while running critical code
# Execute critical code
print "Running Crit Code"
time.sleep(5)
print "Finished Crit Code"
# Do other things
interrupt = threading.Lock()
path = "testfil.txt"
t1 = threading.Thread(target = isr, args = (path, interrupt))
t2 = threading.Thread(target = func2, args = (interrupt,))
t1.start()
t2.start()
# Create and "Update" to the file
time.sleep(12)
chngfile = open("testfil.txt","w")
chngfile.write("changing the file")
chngfile.close()
time.sleep(10)
One standard OS way to handle interrupts is to enqueue the interrupt so another kernel thread can process it.
This partially applies in Python.
I am writing some code that has a thread checking a file to see if it updates and I want it to interrupt the main program so it is aware of the update, deals with it appropriately, and returns to the line it was running when interrupted.
You have multiple threads. You don't need to "interrupt" the main program. Simply "deal with it appropriately" in a separate thread. The main thread will find the updates when the other thread has "dealt with it appropriately".
This is why we have locks. To be sure that shared state is updated correctly.
You interrupt a thread by locking a resource the thread needs.
You make a thread interruptable by acquiring locks on resources.
In python we call that pattern "function calls". You cannot do this with exceptions; exceptions only unroll the stack, and always to the first enclosing except clause.
Microcontrollers have interrupts to support asynchronous events; but the same mechanism is also used in software interrupts for system calls, because an interrupt can be configured to have a different set of protection bits; the system call can be allowed to do more than the user program calling it. Python doesn't have any kind of protection levels like this, and so software interrupts are not of much use here.
As for handling asynchronous events, you can do that in python, using the signal module, but you may want to step lightly if you are also using threads.

Handle a blocking function call in Python

I'm working with the Gnuradio framework. I handle flowgraphs I generate to send/receive signals. These flowgraphs initialize and start, but they don't return the control flow to my application:
I imported time
while time.time() < endtime:
# invoke GRC flowgraph for 1st sequence
if not seq1_sent:
tb = send_seq_2.top_block()
tb.Run(True)
seq1_sent = True
if time.time() < endtime:
break
# invoke GRC flowgraph for 2nd sequence
if not seq2_sent:
tb = send_seq_2.top_block()
tb.Run(True)
seq2_sent = True
if time.time() < endtime:
break
The problem is: only the first if statement invokes the flow-graph (that interacts with the hardware). I'm stuck in this. I could use a Thread, but I'm unexperienced how to timeout threads in Python. I doubt that this is possible, because it seems killing threads isn't within the APIs. This script only has to work on Linux...
How do you handle blocking functions with Python properly - without killing the whole program.
Another more concrete example for this problem is:
import signal, os
def handler(signum, frame):
# print 'Signal handler called with signal', signum
#raise IOError("Couldn't open device!")
import time
print "wait"
time.sleep(3)
def foo():
# Set the signal handler and a 5-second alarm
signal.signal(signal.SIGALRM, handler)
signal.alarm(3)
# This open() may hang indefinitely
fd = os.open('/dev/ttys0', os.O_RDWR)
signal.alarm(0) # Disable the alarm
foo()
print "hallo"
How do I still get print "hallo". ;)
Thanks,
Marius
First of all - the use of signals should be avoided at all cost:
1) It may lead to a deadlock. SIGALRM may reach the process BEFORE the blocking syscall (imagine super-high load in the system!) and the syscall will not be interrupted. Deadlock.
2) Playing with signals may have some nasty non-local consequences. For example, syscalls in other threads may be interrupted which usually is not what you want. Normally syscalls are restarted when (not a deadly) signal is received. When you set up a signal handler it automatically turns off this behavior for the whole process, or thread group so to say. Check 'man siginterrupt' on that.
Believe me - I met two problems before and they are not fun at all.
In some cases the blocking can be avoided explicitely - I strongly recommend using select() and friends (check select module in Python) to handle blocking writes and reads. This will not solve blocking open() call, though.
For that I've tested this solution and it works well for named pipes. It opens in a non-blocking way, then turns it off and uses select() call to eventually timeout if nothing is available.
import sys, os, select, fcntl
f = os.open(sys.argv[1], os.O_RDONLY | os.O_NONBLOCK)
flags = fcntl.fcntl(f, fcntl.F_GETFL, 0)
fcntl.fcntl(f, fcntl.F_SETFL, flags & ~os.O_NONBLOCK)
r, w, e = select.select([f], [], [], 2.0)
if r == [f]:
print 'ready'
print os.read(f, 100)
else:
print 'unready'
os.close(f)
Test this with:
mkfifo /tmp/fifo
python <code_above.py> /tmp/fifo (1st terminal)
echo abcd > /tmp/fifo (2nd terminal)
With some additional effort select() call can be used as a main loop of the whole program, aggregating all events - you can use libev or libevent, or some Python wrappers around them.
When you can't explicitely force non-blocking behavior, say you just use an external library, then it's going to be much harder. Threads may do, but obviously it is not a state-of-the-art solution, usually being just wrong.
I'm afraid that in general you can't solve this in a robust way - it really depends on WHAT you block.
IIUC, each top_block has a stop method. So you actually can run the top_block in a thread, and issue a stop if the timeout has arrived. It would be better if the top_block's wait() also had a timeout, but alas, it doesn't.
In the main thread, you then need to wait for two cases: a) the top_block completes, and b) the timeout expires. Busy-waits are evil :-), so you should use the thread's join-with-timeout to wait for the thread. If the thread is still alive after the join, you need to stop the top_run.
You can set a signal alarm that will interrupt your call with a timeout:
http://docs.python.org/library/signal.html
signal.alarm(1) # 1 second
my_blocking_call()
signal.alarm(0)
You can also set a signal handler if you want to make sure it won't destroy your application:
def my_handler(signum, frame):
pass
signal.signal(signal.SIGALRM, my_handler)
EDIT:
What's wrong with this piece of code ? This should not abort your application:
import signal, time
def handler(signum, frame):
print "Timed-out"
def foo():
# Set the signal handler and a 5-second alarm
signal.signal(signal.SIGALRM, handler)
signal.alarm(3)
# This open() may hang indefinitely
time.sleep(5)
signal.alarm(0) # Disable the alarm
foo()
print "hallo"
The thing is:
The default handler for SIGALRM is to abort the application, if you set your handler then it should no longer stop the application.
Receiving a signal usually interrupts system calls (then unblocks your application)
The easy part of your question relates to the signal handling. From the perspective of the Python runtime a signal which has been received while the interpreter was making a system call is presented to your Python code as an OSError exception with an errno attributed corresponding to errno.EINTR
So this probably works roughly as you intended:
#!/usr/bin/env python
import signal, os, errno, time
def handler(signum, frame):
# print 'Signal handler called with signal', signum
#raise IOError("Couldn't open device!")
print "timed out"
time.sleep(3)
def foo():
# Set the signal handler and a 5-second alarm
signal.signal(signal.SIGALRM, handler)
try:
signal.alarm(3)
# This open() may hang indefinitely
fd = os.open('/dev/ttys0', os.O_RDWR)
except OSError, e:
if e.errno != errno.EINTR:
raise e
signal.alarm(0) # Disable the alarm
foo()
print "hallo"
Note I've moved the import of time out of the function definition as it seems to be poor form to hide imports in that way. It's not at all clear to me why you're sleeping in your signal handler and, in fact, it seems like a rather bad idea.
The key point I'm trying to make is that any (non-ignored) signal will interrupt your main line of Python code execution. Your handler will be invoked with arguments indicating which signal number triggered the execution (allowing for one Python function to be used for handling many different signals) and a frame object (which could be used for debugging or instrumentation of some sort).
Because the main flow through the code is interrupted it's necessary for you to wrap that code in some exception handling in order to regain control after such events have occurred. (Incidentally if you're writing code in C you'd have the same concern; you have to be prepared for any of your library functions with underlying system calls to return errors and handle -EINTR in the system errno by looping back to retry or branching to some alternative in your main line (such as proceeding to some other file, or without any file/input, etc).
As others have indicated in their responses to your question, basing your approach on SIGALARM is likely to be fraught with portability and reliability issues. Worse, some of these issues may be race conditions that you'll never encounter in your testing environment and may only occur under conditions that are extremely hard to reproduce. The ugly details tend to be in cases of re-entrancy --- what happens if signals are dispatched during execution of your signal handler?
I've used SIGALARM in some scripts and it hasn't been an issue for me, under Linux. The code I was working on was suitable to the task. It might be adequate for your needs.
Your primary question is difficult to answer without knowing more about how this Gnuradio code behaves, what sorts of objects you instantiate from it, and what sorts of objects they return.
Glancing at the docs to which you've linked, I see that they don't seem to offer any sort of "timeout" argument or setting that could be used to limit blocking behavior directly. In the table under "Controlling Flow Graphs" I see that they specifically say that .run() can execute indefinitely or until SIGINT is received. I also note that .start() can start threads in your application and, it seems, returns control to your Python code line while those are running. (That seems to depend on the nature of your flow graphs, which I don't understand sufficiently).
It sounds like you could create your flow graphs, .start() them, and then (after some time processing or sleeping in your main line of Python code) call the .lock() method on your controlling object (tb?). This, I'm guessing, puts the Python representation of the state ... the Python object ... into a quiescent mode to allow you to query the state or, as they say, reconfigure your flow graph. If you call .run() it will call .wait() after it calls .start(); and .wait() will apparently run until either all blocks "indicate they are done" or until you call the object's .stop() method.
So it sounds like you want to use .start() and neither .run() nor .wait(); then call .stop() after doing any other processing (including time.sleep()).
Perhaps something as simple as:
tb = send_seq_2.top_block()
tb.start()
time.sleep(endtime - time.time())
tb.stop()
seq1_sent = True
tb = send_seq_2.top_block()
tb.start()
seq2_sent = True
.. though I'm suspicious of my time.sleep() there. Perhaps you want to do something else where you query the tb object's state (perhaps entailing sleeping for smaller intervals, calling its .lock() method, and accessing attributes that I know nothing about and then calling its .unlock() before sleeping again.
if not seq1_sent:
tb = send_seq_2.top_block()
tb.Run(True)
seq1_sent = True
if time.time() < endtime:
break
If the 'if time.time() < endtime:' then you will break out of the loop and the seq2_sent stuff will never be hit, maybe you mean 'time.time() > endtime' in that test?
you could try using Deferred execution... Twisted framework uses them alot
http://www6.uniovi.es/python/pycon/papers/deferex/
You mention killing threads in Python - this is partialy possible although you can kill/interrupt another thread only when Python code runs, not in C code, so this may not help you as you want.
see this answer to another question:
python: how to send packets in multi thread and then the thread kill itself
or google for killable python threads for more details like this:
http://code.activestate.com/recipes/496960-thread2-killable-threads/
If you want to set a timeout on a blocking function, threading.Thread as the method join(timeout) which blocks until the timeout.
Basically, something like that should do what you want :
import threading
my_thread = threading.Thread(target=send_seq_2.top_block)
my_thread.start()
my_thread.join(TIMEOUT)

Categories