Training neural network in IBM Load Sharing Facility (LSF)

Training neural network in IBM Load Sharing Facility (LSF) - python

I was granted an access to some high-performance computing system to conduct some experiments with machine learning.
This system has an IBM's LSF 10.1 installed.
I was instructed to run bsub command to submit a new ML task to a queue.
I use Python+Keras+Tensorflow for my tasks.
My typical workflow is following. I define NN architecture and training parameters in a python script, train.py, commit it to git repo, then run it.
Then I make some changes in train.py, commit it and run again.
I've developed following bsub script
#!/bin/bash
#
#BSUB -P "project"
#BSUB -q queue
#BSUB -n 1
#BSUB -o %J.log
#BSUB -e %J.err
#BSUB -cwd "/home/user/my_project/nntrain"
module load cuda9.0 cudnn_v7 nccl_2.1.15
source /home/user/my_python/bin/activate
export PYTHONPATH=/home/user/my_project/lib
python train.py 2>&1 | tee ${LSB_JOBID}_out.log
Now the question.
I've defined a network, then have run bsub < batch_submit.
The job is put in a queue and is assigned some identifier, say 12345678.
While it is not running, waiting for a next free node, I make some changes to train.py to create a new variant and submit it again in a similar manner: bsub < batch_submit
Let the new job ID be 12345692. The job 12345678 is still waiting.
Now I've got two jobs, waiting for their nodes.
What about the script train.py?
Will it be the same for both of them?

Yes, it will. When you submit the job, bsub will look only at the first few lines starting with #BSUB in order to determine what resources are required by your job, and on which node(s) to run it best.
All the other parts of the script, which do not start with #BSUB, are interpreted only when the script stops pending and starts running. In one particular line, bash will encounter the command python train.py, load the current version of train.py, and execute it.
That is, bsub does not "freeze" the environment in any way; when the job starts running, it will run the latest version of train.py. If you submit two jobs that both refer to the same .py-file, they both will run the same python script (the latest version).
In case you're wondering how to run thousand jobs with thousand different settings, here is what I usually do:
Make sure that your .py script can either accept command line arguments with configuration parameters, or that it can get configuration from some file; Do not rely on manual modification of the script to change some settings.
Create a bsub-template file that looks approximately like your bash script above, but leaves at least one meta-variable which can specify the parameters of the experiment. By "meta-variable" I mean a unique string that doesn't collide with anything else in your bash script, for example NAME_OF_THE_DATASET:
#!/bin/bash
#
#BSUB -P "project"
#BSUB -q queue
#BSUB -n 1
#BSUB -o %J.log
#BSUB -e %J.err
#BSUB -cwd "/home/user/project/nntrain"
module load cuda9.0 cudnn_v7 nccl_2.1.15
source /home/user/my_python/bin/activate
export PYTHONPATH=/home/user/my_project/lib
python train.py NAME_OF_THE_DATASET 2>&1 | tee ${LSB_JOBID}_out.log
Create a separate bash-script with a loop that plugs in different values for the metavariable (e.g. by replacing NAME_OF_THE_DATASET by myDataset1.csv, ... , myDatasetN.csv using sed), and then submits the modified template by bsub.
It might be not the simplest solution (one probably can get away with simpler numbering schemes using the facilities of bsub itself), but I found it to be very flexible, because it works equally well with multiple meta-variables and all kind of flags and settings, and it also lets you insert different preprocessing scripts into the bsub template.

Related

Fdisk partitions in fabric script

In the hard disk partitions, I used fdisk module. In the fdisk module, it is asking some command line inputs like below.
Command (m for help): p
I need to run this module on 16 servers.So I am using fabric script to run this on the 16 servers. But every time it is asking the input commands.
Is there any option in the fabric module to give standard commands.

So, this is just a matter of figuring out the fdisk commands you need and then creating a bash script out of them. There are four options here. Pick one of them and then fabricify it--e.g.,:
sudo('apt-get update')
sudo('apt-get install parted')
sudo('parted -a optimal /dev/usb mkpart primary 0% 4096MB')
Replace /dev/usb with your disk. You'll also have to add a mount and add the entry to fstab.

How does one make sure that the python submission script in slurm is in the location from where the sbatch command was given?

I have a python submission script that I run with sbatch using slurm:
sbatch batch.py
when I do this things do not work properly because I assume, the batch.py process does not inherit the right environment variables. Thus instead of running batch.py from where the sbatch command was done, its ran from somewhere else (/ I believe). I have managed to fix this by doing wrapping the python script with a bash script:
#!/usr/bin/env bash
cd path/to/scripts
python script.py
this temporary hack sort of works it seems though it seems that it avoids the question all together rather than addressing it. Does someone know how to fix this in a better way?
I know for example, that in docker the -w or -WORKDIR exists so that the docker container knows where its suppose to be at. I was wondering if something like that existed for slurm.

Slurm is designed to push the user's environment at submit time to the job, except for variables explicitly disabled by the user or the system administrator.
But the way the script is run is as follows: the script is copied on the master node of the allocation in a Slurm specific directory and run from there, with the $PWD set to the directory where the sbatch command was run.
You can see that with a simple script like this one:
$ cat t.sh
#!/bin/bash
#
#SBATCH --job-name=test_ms
#SBATCH --output=res_ms.txt
echo $PWD
dirname $(readlink -f "$0")
$ sbatch t.sh
Submitted batch job 1109631
$ cat res_ms.txt
/home/damienfrancois/
/var/spool/slurm/job1109631
One consequence is that Python scripts that import modules in the current directory fail to do so. The workaround is then to explicitly add sys.path.append(os.getcwd()) before the failing imports.

Disassociating process pipes from the calling shell

I am trying to use Fabric to send commands which will run many physics simulations (executables) on many different computers which all share the same storage. I would like my script to
ssh into a machine
begin the simulation, for example by running run('nohup nice -n 5 ./interp 1 2 7') (the executable is called interp and run is a function from the Fabric.api library)
detach from the shell and run another simulation on another (or the same) computer.
However I cannot get Fabric to accomplish part 3. It hangs up on the first simulation and doesn't detach until the simulation stops, which defeats the whole point.
My problem, according to the documentation
is that
Because Fabric executes a shell on the remote end for each invocation of run or sudo (see also), backgrounding a process via the shell will not work as expected. Backgrounded processes may still prevent the calling shell from exiting until they stop running, and this in turn prevents Fabric from continuing on with its own execution.
The key to fixing this is to ensure that your process’ standard pipes are all disassociated from the calling shell
The documentation provides 3 suggestions, but it is not possible for me to "use a pre-existing daemonization technique," the computers I have access to do not have screen, tmux, or dtach installed (nor can I install them), and the second proposal of including >& /dev/null < /dev/null in my command has not worked either (as far as I can tell it changed nothing).
Is there another way I can disassociate the process pipes from the calling shell?

The documentation you linked to gives an example of nohup use which you haven't followed all that closely. Merging that example with what you've tried so far gives me something that I, since I don't have Fabric installed, cannot test, but might be interesting to try:
run('nohup nice -n 5 ./interp 1 2 7 < /dev/null &> /tmp/interp127.out &')
Redirect output to /dev/null rather than my contrived output file (/tmp/interp127.out) if you don't care what the interp command emits to its stdout/stderr.
Assuming the above works, I'm unsure how you would detect that a simulation has completed, but your question doesn't seem to concern itself with that detail.

Run bash scripts in parallel from python script

I'm facing a problem in python:
My script, at a certain point, has to run some test script written in bash, and I have to do it in parallel, and wait until they end.
I've already tried :
os.system("./script.sh &")
inside a for loop but it did not worked.
Any suggest?
Thank you!
edit
I have nt correctly explained my situation:
My phyton script resides in the home dir;
my sh scripts resides in other dirs, for instance /tests/folder1 and /tests/folder2;
Trying to use os.system implies the usage of os.chdir prior to call os.system (to avoid troubles on "no such files or directory", my .sh scripts contains some relative references), and also this method is blocking my terminal output.
Trying to use Popen and passing all the path fro home folder to my .sh lead to launch zombie processes without any responses or other.
Hope to find a solution,
Thank you guys!

Have you looked at subprocess? The convenience functions call and check_output block, but the default Popen object doesn't:
processes = []
processes.append(subprocess.Popen(['script.sh']))
processes.append(subprocess.Popen(['script2.sh']))
...
return_codes = [p.wait() for p in processes]

Can you use GNU Parallel?
ls test_scripts*.sh | parallel
Or:
parallel ::: script1.sh script2.sh ... script100.sh
GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to. It can often replace a for loop.
If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:
GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:
Installation
If GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:
(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash
For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README
Learn more
See more examples: http://www.gnu.org/software/parallel/man.html
Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html
Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel

Multi-threading using shell script

I am using a python script to perform some calculations in my image and save the array obtained into a .png file. I deal with 3000 to 4000 images. To perform all these I use a shell script in Ubuntu. It gets the job done. But is there anyway to make it fast. I have 4 cores in my machine. How to use all of them. The script I am using is below
#!/bin/bash
cd $1
for i in $(ls *.png)
do
python ../tempcalc12.py $i
done
cd ..
tempcalc12.py is my python script
This question might be trivial. But I am really new to programming.
Thank you

xargs has --max-procs= ( or -P) option which does the job in parallel.
The following code does the job in maximum of 4 processes.
ls *.png | xargs -n 1 -P 4 python ../tempcalc12.py

You can just add a & to the python line to have everything executed in parallel:
python ../tempcalc12.py $i &
This is a bad idea though, as having too many processes will just slow everything down.
What you can do is limit the number of threads, like this:
MAX_THREADS=4
for i in $(ls *.png); do
python ../tempcalc12.py $i &
while [ $( jobs | wc -l ) -ge "$MAX_THREADS" ]; do
sleep 0.1
done
done
Every 100ms, it will check the number of running jobs, and if it is inferior to MAX_THREADS, add new jobs in background.
This is a nice hack if you just want a quick working solution, but you might also want to investigate what GNU Parallel can do.

If you have GNU Parallel you can do:
parallel python ../tempcalc12.py ::: *.png
It will do The Right Thing by spawning a job per core, even if the names your PNGs have space, ', or " in them. It also makes sure the output from different jobs are not mixed together, so if you use the output you are guaranteed that you will not get half-a-line from two different jobs.
GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to.
If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:
GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:
Installation
If GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:
(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash
For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README
Learn more
See more examples: http://www.gnu.org/software/parallel/man.html
Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html
Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Training neural network in IBM Load Sharing Facility (LSF) - python

Related

Fdisk partitions in fabric script

How does one make sure that the python submission script in slurm is in the location from where the sbatch command was given?

Disassociating process pipes from the calling shell

Run bash scripts in parallel from python script

Multi-threading using shell script

Categories

Resources