I am using a python script to perform some calculations in my image and save the array obtained into a .png file. I deal with 3000 to 4000 images. To perform all these I use a shell script in Ubuntu. It gets the job done. But is there anyway to make it fast. I have 4 cores in my machine. How to use all of them. The script I am using is below
#!/bin/bash
cd $1
for i in $(ls *.png)
do
python ../tempcalc12.py $i
done
cd ..
tempcalc12.py is my python script
This question might be trivial. But I am really new to programming.
Thank you
xargs has --max-procs= ( or -P) option which does the job in parallel.
The following code does the job in maximum of 4 processes.
ls *.png | xargs -n 1 -P 4 python ../tempcalc12.py
You can just add a & to the python line to have everything executed in parallel:
python ../tempcalc12.py $i &
This is a bad idea though, as having too many processes will just slow everything down.
What you can do is limit the number of threads, like this:
MAX_THREADS=4
for i in $(ls *.png); do
python ../tempcalc12.py $i &
while [ $( jobs | wc -l ) -ge "$MAX_THREADS" ]; do
sleep 0.1
done
done
Every 100ms, it will check the number of running jobs, and if it is inferior to MAX_THREADS, add new jobs in background.
This is a nice hack if you just want a quick working solution, but you might also want to investigate what GNU Parallel can do.
If you have GNU Parallel you can do:
parallel python ../tempcalc12.py ::: *.png
It will do The Right Thing by spawning a job per core, even if the names your PNGs have space, ', or " in them. It also makes sure the output from different jobs are not mixed together, so if you use the output you are guaranteed that you will not get half-a-line from two different jobs.
GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to.
If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:
GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:
Installation
If GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:
(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash
For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README
Learn more
See more examples: http://www.gnu.org/software/parallel/man.html
Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html
Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel
Related
I have a project where I have to regularly use a shell script that does some pre-processing of files. It has to be done this way per project requirements and legacy reasons - I've inherited a large portion of this code.
Once those files are processed, the output files are FURTHER processed by a Python script.
Is there any good way to run this in parallel? Right now, this is how my workflow looks.
Call shell script, processing thousands of files.
Once finished, call Python script, processing even more files.
Once finished, call SQL script to insert all of these files into a database.
If it's possible to parallelize either as a group of ( One file shell --> Python --> SQL ) or parallelize each task ( Parallel shell, Parallel Python, Parallel SQL ) that'd be great. Everything I've read though seems to imply this is a logistical nightmare due to running into issues with R/W. Is this true and if not any points in the right direction?
For a shell, you can use xargs to run multiple processes in parallel.
Example:
echo dir1 dir2 dir3 | xargs -P 3 -I NAME tar czf NAME.tar.gz NAME
Key -P - say xargs run 3 parrallel process.
For a python, you can use ThreadPoolExecutor! from futures.
For SQL I can’t tell anything, I need to look at which database you are using.
I would like to run two python scripts at the same time on my lap top without any decreasing in their calculation's speed.
I have searched and saw this question saying that we should use bash file.
I have searched but I did not understand what should I do and how to run those scrips with this way called bash.
python script1.py &
python script2.py &
I am inexperienced in it and I need your professional advice.
I do not understand how to do that, where and how.
I am using Windows 64bit.
Best
PS: The answer I checked the mark is a way to run in parallel two tasks, but it does not decrease the calculation time for two parallel tasks at all.
If you can install GNU Parallel on Windows under Git Bash (ref), then you can run the two scripts on separate CPUs this way:
▶ (cat <<EOF) | parallel --jobs 2
python script1.py
python script2.py
EOF
Note from the parallel man page:
--jobs N
Number of jobslots on each machine. Run up to N jobs in parallel.
0 means as many as possible. Default is 100% which will run one job per
CPU on each machine.
Note that the question has been updated to state that parallelisation does not improve calculation time, which is not generally a correct statement.
While the benefits of parallelisation are highly machine- and workload-dependent, parallelisation significantly improves the processing time of CPU-bound processes on multi-core computers.
Here is a demonstration based on calculating 50,000 digits of Pi using Spigot's algorithm (code) on my quad-core MacBook Pro:
Single task (52s):
▶ time python3 spigot.py
...
python3 spigot.py 52.73s user 0.32s system 98% cpu 53.857 total
Running the same computation twice in GNU parallel (74s):
▶ (cat <<EOF) | time parallel --jobs 2
python3 spigot.py
python3 spigot.py
EOF
...
parallel --jobs 2 74.19s user 0.48s system 196% cpu 37.923 total
Of course this is on a system that is busy running an operating system and all my other apps, so it doesn't halve the processing time, but it is a big improvement all the same.
See also this related Stack Overflow answer.
I use a batch file which contains these lines:
start python script1.py
start python script2.py
This opens a new window for each start statement.
A quite easy way to run parallel jobs of every kind is using nohup. This redirect the output to a file call nohup.out (by default). In your case you should just write:
nohup python script1.py > output_script1 &
nohup python script2.py > output_script2 &
That's it. With nohup you can also logout and the script will be continuing until they have finished
I was granted an access to some high-performance computing system to conduct some experiments with machine learning.
This system has an IBM's LSF 10.1 installed.
I was instructed to run bsub command to submit a new ML task to a queue.
I use Python+Keras+Tensorflow for my tasks.
My typical workflow is following. I define NN architecture and training parameters in a python script, train.py, commit it to git repo, then run it.
Then I make some changes in train.py, commit it and run again.
I've developed following bsub script
#!/bin/bash
#
#BSUB -P "project"
#BSUB -q queue
#BSUB -n 1
#BSUB -o %J.log
#BSUB -e %J.err
#BSUB -cwd "/home/user/my_project/nntrain"
module load cuda9.0 cudnn_v7 nccl_2.1.15
source /home/user/my_python/bin/activate
export PYTHONPATH=/home/user/my_project/lib
python train.py 2>&1 | tee ${LSB_JOBID}_out.log
Now the question.
I've defined a network, then have run bsub < batch_submit.
The job is put in a queue and is assigned some identifier, say 12345678.
While it is not running, waiting for a next free node, I make some changes to train.py to create a new variant and submit it again in a similar manner: bsub < batch_submit
Let the new job ID be 12345692. The job 12345678 is still waiting.
Now I've got two jobs, waiting for their nodes.
What about the script train.py?
Will it be the same for both of them?
Yes, it will. When you submit the job, bsub will look only at the first few lines starting with #BSUB in order to determine what resources are required by your job, and on which node(s) to run it best.
All the other parts of the script, which do not start with #BSUB, are interpreted only when the script stops pending and starts running. In one particular line, bash will encounter the command python train.py, load the current version of train.py, and execute it.
That is, bsub does not "freeze" the environment in any way; when the job starts running, it will run the latest version of train.py. If you submit two jobs that both refer to the same .py-file, they both will run the same python script (the latest version).
In case you're wondering how to run thousand jobs with thousand different settings, here is what I usually do:
Make sure that your .py script can either accept command line arguments with configuration parameters, or that it can get configuration from some file; Do not rely on manual modification of the script to change some settings.
Create a bsub-template file that looks approximately like your bash script above, but leaves at least one meta-variable which can specify the parameters of the experiment. By "meta-variable" I mean a unique string that doesn't collide with anything else in your bash script, for example NAME_OF_THE_DATASET:
#!/bin/bash
#
#BSUB -P "project"
#BSUB -q queue
#BSUB -n 1
#BSUB -o %J.log
#BSUB -e %J.err
#BSUB -cwd "/home/user/project/nntrain"
module load cuda9.0 cudnn_v7 nccl_2.1.15
source /home/user/my_python/bin/activate
export PYTHONPATH=/home/user/my_project/lib
python train.py NAME_OF_THE_DATASET 2>&1 | tee ${LSB_JOBID}_out.log
Create a separate bash-script with a loop that plugs in different values for the metavariable (e.g. by replacing NAME_OF_THE_DATASET by myDataset1.csv, ... , myDatasetN.csv using sed), and then submits the modified template by bsub.
It might be not the simplest solution (one probably can get away with simpler numbering schemes using the facilities of bsub itself), but I found it to be very flexible, because it works equally well with multiple meta-variables and all kind of flags and settings, and it also lets you insert different preprocessing scripts into the bsub template.
I have a number of scripts to run, some of which have one or more scripts that must be completed first. I've read a number of examples showing how bash's control operators work, but haven't found any good examples to address the complexity of the logic i'm trying to implement.
I have p_01.py and p_03.py that are both requirements for p_09.py, but also have individual processes that only require p_01. For example:
((python p_01.py & python p_03.py) && python p_09.py) &
(python p_01.py &&
(
(python p_05.py;
python p_10.py) &
(python p_08.py;
python p_11.py)
)
)
wait $(jobs -p)
My question is, how can I accomplish all of the scripts running only after their requirements without repeating the running scripts (such as p_01.py, which you'll notice it's used twice above)? I'm looking for a generalized answer with some detail, since in actuality the dependencies are more numerous/nested than the example above. Thank you!
If you are thinking of the scripts in terms of their dependencies, that's difficult to translate directly to a master script. Consider using make, which would let you express these dependencies directly:
SCRIPTS = $(wildcard *.py)
.PHONY: all
all: $(SCRIPTS)
$(SCRIPTS):
python $#
p_05.py p_08.py p_09.py: p_01.py
p_09.py: p_03.py
p_10.py: p_05.py
p_11.py: p_08.py
Running make -B -j4 would run all of the Python scripts with up to 4 executing in parallel at any one time.
I'm facing a problem in python:
My script, at a certain point, has to run some test script written in bash, and I have to do it in parallel, and wait until they end.
I've already tried :
os.system("./script.sh &")
inside a for loop but it did not worked.
Any suggest?
Thank you!
edit
I have nt correctly explained my situation:
My phyton script resides in the home dir;
my sh scripts resides in other dirs, for instance /tests/folder1 and /tests/folder2;
Trying to use os.system implies the usage of os.chdir prior to call os.system (to avoid troubles on "no such files or directory", my .sh scripts contains some relative references), and also this method is blocking my terminal output.
Trying to use Popen and passing all the path fro home folder to my .sh lead to launch zombie processes without any responses or other.
Hope to find a solution,
Thank you guys!
Have you looked at subprocess? The convenience functions call and check_output block, but the default Popen object doesn't:
processes = []
processes.append(subprocess.Popen(['script.sh']))
processes.append(subprocess.Popen(['script2.sh']))
...
return_codes = [p.wait() for p in processes]
Can you use GNU Parallel?
ls test_scripts*.sh | parallel
Or:
parallel ::: script1.sh script2.sh ... script100.sh
GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to. It can often replace a for loop.
If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:
GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:
Installation
If GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:
(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash
For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README
Learn more
See more examples: http://www.gnu.org/software/parallel/man.html
Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html
Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel