How run a python program faster using GNU parallel? - python

I have a python program which can be executed using multiple threads, however, it fails (segmentation fault, core dumped) when using more than 1 thread.
What I was thinking of an alternate possibility to run script using GNU parallel. I am very new to this and have limited knowledge about the same. Any help would be appreciated.

Say you have example.py and you want to run it N times. You can run the following
seq 10 | parallel -N0 --jobs 0 example.py
Breaking this down seq 10 will cause 10 jobs to be run. The -N0 flag tells us to ignore the input which would normally read in the input output by the seq 10 command. --jobs 0 will let as many run in parallel as you want.
As far as I know parallel doesn't let you say run this program X times with no input so you must abuse piping into the command with seq and then ignoring it with the -N0 flag.
Read this for further examples on commands: https://www.gnu.org/software/parallel/man.html

Related

Kill an MPI process in all machines

Suppose that I run an MPI program involving 25 processes on 25 different machines. The program is initiated at one of them called the "master" with a command like
mpirun -n 25 --hostfile myhostfile.txt python helloworld.py
This is executed on Linux with some bash script and it uses mpi4py. Sometimes, in the middle of execution, I want to stop the program in all machines. I don't care if this is done graciously or not since the data I might need is already saved.
Usually, I press Ctrl + C on terminal of the "master" and I think it works as described above. Is this true? In other words, will it stop this specific MPI program in all machines?
Another method I tried is to get the PID of the process in the "master" and kill it. I am not sure about this either.
Do the above methods work as described? If no, what else do you suggest? Note that I want to avoid the use of MPI calls for that purpose like MPI_Abort that some other discussions here and here suggest.

Running two python scripts with bash file

I would like to run two python scripts at the same time on my lap top without any decreasing in their calculation's speed.
I have searched and saw this question saying that we should use bash file.
I have searched but I did not understand what should I do and how to run those scrips with this way called bash.
python script1.py &
python script2.py &
I am inexperienced in it and I need your professional advice.
I do not understand how to do that, where and how.
I am using Windows 64bit.
Best
PS: The answer I checked the mark is a way to run in parallel two tasks, but it does not decrease the calculation time for two parallel tasks at all.
If you can install GNU Parallel on Windows under Git Bash (ref), then you can run the two scripts on separate CPUs this way:
▶ (cat <<EOF) | parallel --jobs 2
python script1.py
python script2.py
EOF
Note from the parallel man page:
--jobs N
Number of jobslots on each machine. Run up to N jobs in parallel.
0 means as many as possible. Default is 100% which will run one job per
CPU on each machine.
Note that the question has been updated to state that parallelisation does not improve calculation time, which is not generally a correct statement.
While the benefits of parallelisation are highly machine- and workload-dependent, parallelisation significantly improves the processing time of CPU-bound processes on multi-core computers.
Here is a demonstration based on calculating 50,000 digits of Pi using Spigot's algorithm (code) on my quad-core MacBook Pro:
Single task (52s):
▶ time python3 spigot.py
...
python3 spigot.py 52.73s user 0.32s system 98% cpu 53.857 total
Running the same computation twice in GNU parallel (74s):
▶ (cat <<EOF) | time parallel --jobs 2
python3 spigot.py
python3 spigot.py
EOF
...
parallel --jobs 2 74.19s user 0.48s system 196% cpu 37.923 total
Of course this is on a system that is busy running an operating system and all my other apps, so it doesn't halve the processing time, but it is a big improvement all the same.
See also this related Stack Overflow answer.
I use a batch file which contains these lines:
start python script1.py
start python script2.py
This opens a new window for each start statement.
A quite easy way to run parallel jobs of every kind is using nohup. This redirect the output to a file call nohup.out (by default). In your case you should just write:
nohup python script1.py > output_script1 &
nohup python script2.py > output_script2 &
That's it. With nohup you can also logout and the script will be continuing until they have finished

Disassociating process pipes from the calling shell

I am trying to use Fabric to send commands which will run many physics simulations (executables) on many different computers which all share the same storage. I would like my script to
ssh into a machine
begin the simulation, for example by running run('nohup nice -n 5 ./interp 1 2 7') (the executable is called interp and run is a function from the Fabric.api library)
detach from the shell and run another simulation on another (or the same) computer.
However I cannot get Fabric to accomplish part 3. It hangs up on the first simulation and doesn't detach until the simulation stops, which defeats the whole point.
My problem, according to the documentation
is that
Because Fabric executes a shell on the remote end for each invocation of run or sudo (see also), backgrounding a process via the shell will not work as expected. Backgrounded processes may still prevent the calling shell from exiting until they stop running, and this in turn prevents Fabric from continuing on with its own execution.
The key to fixing this is to ensure that your process’ standard pipes are all disassociated from the calling shell
The documentation provides 3 suggestions, but it is not possible for me to "use a pre-existing daemonization technique," the computers I have access to do not have screen, tmux, or dtach installed (nor can I install them), and the second proposal of including >& /dev/null < /dev/null in my command has not worked either (as far as I can tell it changed nothing).
Is there another way I can disassociate the process pipes from the calling shell?
The documentation you linked to gives an example of nohup use which you haven't followed all that closely. Merging that example with what you've tried so far gives me something that I, since I don't have Fabric installed, cannot test, but might be interesting to try:
run('nohup nice -n 5 ./interp 1 2 7 < /dev/null &> /tmp/interp127.out &')
Redirect output to /dev/null rather than my contrived output file (/tmp/interp127.out) if you don't care what the interp command emits to its stdout/stderr.
Assuming the above works, I'm unsure how you would detect that a simulation has completed, but your question doesn't seem to concern itself with that detail.

Run bash scripts in parallel from python script

I'm facing a problem in python:
My script, at a certain point, has to run some test script written in bash, and I have to do it in parallel, and wait until they end.
I've already tried :
os.system("./script.sh &")
inside a for loop but it did not worked.
Any suggest?
Thank you!
edit
I have nt correctly explained my situation:
My phyton script resides in the home dir;
my sh scripts resides in other dirs, for instance /tests/folder1 and /tests/folder2;
Trying to use os.system implies the usage of os.chdir prior to call os.system (to avoid troubles on "no such files or directory", my .sh scripts contains some relative references), and also this method is blocking my terminal output.
Trying to use Popen and passing all the path fro home folder to my .sh lead to launch zombie processes without any responses or other.
Hope to find a solution,
Thank you guys!
Have you looked at subprocess? The convenience functions call and check_output block, but the default Popen object doesn't:
processes = []
processes.append(subprocess.Popen(['script.sh']))
processes.append(subprocess.Popen(['script2.sh']))
...
return_codes = [p.wait() for p in processes]
Can you use GNU Parallel?
ls test_scripts*.sh | parallel
Or:
parallel ::: script1.sh script2.sh ... script100.sh
GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to. It can often replace a for loop.
If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:
GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:
Installation
If GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:
(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash
For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README
Learn more
See more examples: http://www.gnu.org/software/parallel/man.html
Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html
Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel

Multi-threading using shell script

I am using a python script to perform some calculations in my image and save the array obtained into a .png file. I deal with 3000 to 4000 images. To perform all these I use a shell script in Ubuntu. It gets the job done. But is there anyway to make it fast. I have 4 cores in my machine. How to use all of them. The script I am using is below
#!/bin/bash
cd $1
for i in $(ls *.png)
do
python ../tempcalc12.py $i
done
cd ..
tempcalc12.py is my python script
This question might be trivial. But I am really new to programming.
Thank you
xargs has --max-procs= ( or -P) option which does the job in parallel.
The following code does the job in maximum of 4 processes.
ls *.png | xargs -n 1 -P 4 python ../tempcalc12.py
You can just add a & to the python line to have everything executed in parallel:
python ../tempcalc12.py $i &
This is a bad idea though, as having too many processes will just slow everything down.
What you can do is limit the number of threads, like this:
MAX_THREADS=4
for i in $(ls *.png); do
python ../tempcalc12.py $i &
while [ $( jobs | wc -l ) -ge "$MAX_THREADS" ]; do
sleep 0.1
done
done
Every 100ms, it will check the number of running jobs, and if it is inferior to MAX_THREADS, add new jobs in background.
This is a nice hack if you just want a quick working solution, but you might also want to investigate what GNU Parallel can do.
If you have GNU Parallel you can do:
parallel python ../tempcalc12.py ::: *.png
It will do The Right Thing by spawning a job per core, even if the names your PNGs have space, ', or " in them. It also makes sure the output from different jobs are not mixed together, so if you use the output you are guaranteed that you will not get half-a-line from two different jobs.
GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to.
If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:
GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:
Installation
If GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:
(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash
For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README
Learn more
See more examples: http://www.gnu.org/software/parallel/man.html
Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html
Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel

Categories