Run multiple files consecutively via SLURM with individual timeout

Run multiple files consecutively via SLURM with individual timeout - python

I have a python script I run on HPC that takes a list of files in a text file and starts multiple SBATCH runs:
./launch_job.sh 0_folder_file_list.txt
launch_job.sh goes through 0_folder_file_list.txt and starts an SBATCH for each file
SAMPLE_LIST=`cut -d "." -f 1 $1`
for SAMPLE in $SAMPLE_LIST
do
echo "Getting accessions from $SAMPLE"
sbatch get_acc.slurm $SAMPLE
#./get_job.slurm $SAMPLE
done
get_job.slurm has all of my SBATCH information, module loads, etc. and performs
srun --mpi=pmi2 -n 5 python python_script.py ${SAMPLE}.txt
I don't want to start all of the jobs at one time, I would like them to run consecutively with a 24-hour maximum run time. I have already set my SBATCH -t to allow for a maximum time but I only want each job to run for a maximum of 24-hours. Is there a srun argument I can set that will accomplish this? Something else?

You can use --wait flag with sbatch.
-W, --wait
Do not exit until the submitted job terminates. The exit code of the sbatch command will be the same as the exit code of the submitted
job. If the job terminated due to a signal rather than a normal exit,
the exit code will be set to 1. In the case of a job array, the exit
code recorded will be the highest value for any task in the job array.
In your case,
for SAMPLE in $SAMPLE_LIST
do
echo "Getting accessions from $SAMPLE"
sbatch --wait get_acc.slurm $SAMPLE
done
So, the next sbatch command will only be called after the first sbatch finishes (your job ended or time limit reached).

Related

How to extract with Python the list of ids of jobs running on an LSF cluster?

I am currently writing a python script to launch many simulations in parallel using this command repeatedly :
os.system("bsub -q reg -app ... file.cir")
And I need to retrieve the job ID list in order to know exactly when all the jobs are completed, to then process the data. My idea is simply to make a loop over the job id list and to check if each of them are completed.
I have tried using getpid() but I believe it only gives me the id of the python process running.
I know bjobs gives you the list of processes running but from there I do not see how to parse the output with my Python file.
How can I do that ?
Would there otherwise be an easier solution to find out when all the processes I run on the LSF cluster are over ?

How to immediately submit all Snakemake jobs to slurm cluster

I'm using snakemake to build a variant calling pipeline that can be run on a SLURM cluster. The cluster has login nodes and compute nodes. Any real computing should be done on the compute nodes in the form of an srun or sbatch job. Jobs are limited to 48 hours of runtime. My problem is that processing many samples, especially when the queue is busy, will take more than 48 hours to process all the rules for every sample. The traditional cluster execution for snakemake leaves a master thread running that only submits rules to the queue after all the rule's dependencies have finished running. I'm supposed to run this master program on a compute node, so this limits the runtime of my entire pipeline to 48 hours.
I know SLURM jobs have dependency directives that tell a job to wait to run until other jobs have finished. Because the snakemake workflow is a DAG, is it possible to submit all the jobs at once, with each job having its dependencies defined by the rule's dependencies from the DAG? After all the jobs are submitted the master thread would complete, circumventing the 48 hour limit. Is this possible with snakemake, and if so, how does it work? I've found the --immediate-submit command line option, but I'm not sure if this has the behavior I'm looking for and how to use the command because my cluster prints Submitted batch job [id] after a job is submitted to the queue instead of just the job id.

Immediate submit unfortunately does not work out-of-the-box, but needs some tuning for it to work. This is because the way dependencies between jobs are passed along differ between cluster systems. A while ago I struggled with the same problem. As the immediate-submit docs say:
Immediately submit all jobs to the cluster instead of waiting for
present input files. This will fail, unless you make the cluster aware
of job dependencies, e.g. via: $ snakemake –cluster ‘sbatch
–dependency {dependencies}. Assuming that your submit script (here
sbatch) outputs the generated job id to the first stdout line,
{dependencies} will be filled with space separated job ids this job
depends on.
So the problem is that sbatch does not output the generated job id to the first stdout line. However we can circumvent this with our own shell script:
parseJobID.sh:
#!/bin/bash
# helper script that parses slurm output for the job ID,
# and feeds it to back to snakemake/slurm for dependencies.
# This is required when you want to use the snakemake --immediate-submit option
if [[ "Submitted batch job" =~ "$#" ]]; then
echo -n ""
else
deplist=$(grep -Eo '[0-9]{1,10}' <<< "$#" | tr '\n' ',' | sed 's/.$//')
echo -n "--dependency=aftercorr:$deplist"
fi;
And make sure to give the script execute permission with chmod +x parseJobID.sh.
We can then call immediate submit like this:
snakemake --cluster 'sbatch $(./parseJobID.sh {dependencies})' --jobs 100 --notemp --immediate-submit
Note that this will submit at max 100 jobs at the same time. You can increase or decrease this to any number you like, but know that most cluster systems do not allow more than 1000 jobs per user at the same time.

Is there any way to run a secondary python script at regular intervals to work on output of a primary script in Slurm?

I am submitting a batch script that involves a primary command/script (an mpi process) that outputs data and I need to evaluate the progress of the primary process by running a secondary Python script at fixed intervals of time when the primary process is still running. Is there any command that would allow me to do so this with a Slurm batch script?
As an example, consider the primary process takes 24 hours, if I place the Python script normally after the end of the primary command/script, it would only run at the end of the primary process. I need the Python command/script to run every 1 hour to process data generated by the primary process. Is this possible on Slurm?

The structure of the script would look like this:
#! /bin/bash
#SBATCH ...
#SBATCH ...
while : ; do sleep 3600 ; python <secondary script> ; done &
mpirun <primary command>
The idea is to run the secondary script in an infinite loop in the background. When the primary command finishes, the job is terminated and the background loop is stopped.

python capture and store curl command process id

i have a python script which uses threading to run a process.
It starts three threads executing curl commands on a different server.
i am not using threading sub-processes or any python sub-processes (as my version of python does not support it ) and I want to kill these specific running curl processes on the different server.
I want to capture the process id at the time the curl command is executed, put it in a list and then use this list of PID's to kill if necessary.
What is the best way to do this ?
I have tried a few ways but nothing working ..
curl command & echo $!
will return the value to screen but i want to capture this. I have tried to set as a variable, export it ..
If i try:
ret,output = remoteConnection.runCmdGetOutput(cmd)
it will return the output (which includes the pid) of one curl command (the first one), but not all 3 (i think this has to do with the threading) and the parent script continues.
Any ideas ?
Thanks

How can I start and stop a Python script from shell

thanks for helping!
I want to start and stop a Python script from a shell script. The start works fine, but I want to stop / terminate the Python script after 10 seconds. (it's a counter that keeps counting). bud is won't stop.... I think it is hanging on the first line.
What is the right way to start wait for 10 seconds en stop?
Shell script:
python /home/pi/count1.py
sleep 10
kill /home/pi/count1.py
It's not working yet. I get the point of doing the script on the background. That's working!. But I get another comment form my raspberry after doing:
python /home/pi/count1.py &
sleep 10; kill /home/pi/count1.py
/home/pi/sebastiaan.sh: line 19: kill: /home/pi/count1.py: arguments must be process or job IDs
It's got to be in the: (but what? Thanks for helping out!)
sleep 10; kill /home/pi/count1.py

You're right, the shell script "hangs" on the first line until the python script finishes. If it doesn't, the shell script won't continue. Therefore you have to use & at the end of the shell command to run it in the background. This way, the python script starts and the shell script continues.
The kill command doesn't take a path, it takes a process id. After all, you might run the same program several times, and then try to kill the first, or last one.
The bash shell supports the $! variable, which is the pid of the last background process.
Your current example script is wrong, because it doesn't run the python job and the sleep job in parallel. Without adornment, the script will wait for the python job to finish, then sleep 10 seconds, then kill.
What you probably want is something like:
python myscript.py & # <-- Note '&' to run in background
LASTPID=$! # Save $! in case you do other background-y stuff
sleep 10; kill $LASTPID # Sleep then kill to set timeout.

You can terminate any process from any other if OS let you do it. I.e. if it isn't some critical process belonging to the OS itself.
The command kill uses PID to kill the process, not the process's name or command.
Use pkill for that.
You can also, send it a different signal instead of SIGTERM (request to terminate a program) that you may wish to detect inside your Python application and respond to it.
For instance you may wish to check if the process is alive and get some data from it.
To do this, choose one of the users custom signals and register them within your Python program using signal module.
To see why your script hangs, see Austin's answer.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Run multiple files consecutively via SLURM with individual timeout - python

Related

How to extract with Python the list of ids of jobs running on an LSF cluster?

How to immediately submit all Snakemake jobs to slurm cluster

Is there any way to run a secondary python script at regular intervals to work on output of a primary script in Slurm?

python capture and store curl command process id

How can I start and stop a Python script from shell

Categories

Resources