I have a project where I have to regularly use a shell script that does some pre-processing of files. It has to be done this way per project requirements and legacy reasons - I've inherited a large portion of this code.
Once those files are processed, the output files are FURTHER processed by a Python script.
Is there any good way to run this in parallel? Right now, this is how my workflow looks.
Call shell script, processing thousands of files.
Once finished, call Python script, processing even more files.
Once finished, call SQL script to insert all of these files into a database.
If it's possible to parallelize either as a group of ( One file shell --> Python --> SQL ) or parallelize each task ( Parallel shell, Parallel Python, Parallel SQL ) that'd be great. Everything I've read though seems to imply this is a logistical nightmare due to running into issues with R/W. Is this true and if not any points in the right direction?
For a shell, you can use xargs to run multiple processes in parallel.
Example:
echo dir1 dir2 dir3 | xargs -P 3 -I NAME tar czf NAME.tar.gz NAME
Key -P - say xargs run 3 parrallel process.
For a python, you can use ThreadPoolExecutor! from futures.
For SQL I can’t tell anything, I need to look at which database you are using.
Related
I have seven python scripts that all manipulate some files in my folder system based on some information on an MSSQL server. The code is written in a way that a script should just restart whenever it has finished. Additionally, the scripts should run in parallel. The order does not matter as long as they are all executed every now and again (that is, it would be bad if Script 1 (reading a file) runs endlessly while Script 7 (deleting all the files that are already read) is never executed. However, it wouldn't matter if Script 1 is run several times before Script 7 is run once).
So far, I've found a solution with PowerShell. I have 7 separate PowerShell Scripts (process1.ps1, process2.ps1, ..., process7.ps1) that all look as follows:
while($true)
{
$i++
Start-Process -NoNewWindow -Wait python F:\somewhere\something.py
Start-Sleep -Seconds 10
}
This works if I open 7 different PowerShell consoles and start one .ps1 in each like this:
& "F:\PowerShellScripts\process1.ps1"
However, opening and monitoring seven sessions every time is cumbersome. Is there a way to start all these processes in one go but still ensure that they are parallelized correctly?
You are able to parallelize command in powershell via:
Jobs
Runspaces
Workflows
For easiest use, take jobs (my recommendation for your needs). For best performance, use runspaces. I have not tried workflows yet.
A starter for jobs:
$scriptpaths = "C:\temp\1.py", "C:\temp\2.py"
foreach ($path in $scriptpaths){
start-job -ScriptBlock {
Param($path)
while($true)
{
Start-Process -NoNewWindow -Wait python $path
Start-Sleep -Seconds 10
}
} -ArgumentList $path
}
Please do read the documentations though, this code is far from ideal. Also, this code does not synchronize your scripts. If one runs faster than the others, they will desynchronize.
I'm facing a problem in python:
My script, at a certain point, has to run some test script written in bash, and I have to do it in parallel, and wait until they end.
I've already tried :
os.system("./script.sh &")
inside a for loop but it did not worked.
Any suggest?
Thank you!
edit
I have nt correctly explained my situation:
My phyton script resides in the home dir;
my sh scripts resides in other dirs, for instance /tests/folder1 and /tests/folder2;
Trying to use os.system implies the usage of os.chdir prior to call os.system (to avoid troubles on "no such files or directory", my .sh scripts contains some relative references), and also this method is blocking my terminal output.
Trying to use Popen and passing all the path fro home folder to my .sh lead to launch zombie processes without any responses or other.
Hope to find a solution,
Thank you guys!
Have you looked at subprocess? The convenience functions call and check_output block, but the default Popen object doesn't:
processes = []
processes.append(subprocess.Popen(['script.sh']))
processes.append(subprocess.Popen(['script2.sh']))
...
return_codes = [p.wait() for p in processes]
Can you use GNU Parallel?
ls test_scripts*.sh | parallel
Or:
parallel ::: script1.sh script2.sh ... script100.sh
GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to. It can often replace a for loop.
If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:
GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:
Installation
If GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:
(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash
For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README
Learn more
See more examples: http://www.gnu.org/software/parallel/man.html
Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html
Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel
I am new to python and still at the level of basic learning. Recently I tried to write a script to generate new folders according to the number supplied in the input text file. After creating those folders I want to copy a file into all those folders at the same time. I can do it by typing
echo equil{1..x} | xargs -n 1 cp *.txt *
in the terminal, and it works fine. Here x is the number of folders I have in my working directory. But my concern is to make it automatic, i.e. to call it from the script, so that the user doesn't need to type this line every time in the terminal. That is why I tried this
sub2 = subprocess.call(['echo', 'equil{1..x}', '|', 'xargs', '-n', '1', 'cp', '*.txt *'])
Can anyone please guide me and show me the mistake. Actually I am not getting any error, rather it is printing this
equil{1..x} | xargs -n 1 cp *.txt *
in the terminal after executing the rest of the script.
You have to use subprocess.Popen if you want to send data to/from stdin/stdout of your subprocesses. And you have to Popen a subprocess for each of the executables, i.e. in your example, one for echo and one for xargs.
There is an example in the docs: https://docs.python.org/2/library/subprocess.html#replacing-shell-pipeline
Another here: Call a shell command containing a 'pipe' from Python and capture STDOUT
However, instead of running echo to produce some lines, you can directly write them in python to the process stdin.
I don't think you can use subprocess.call() like this with pipes. For recipes how to use pipes, see
https://docs.python.org/2/library/subprocess.html#replacing-shell-pipeline
I.e. you would use subprocess.communicate() over two processes.
I am using a python script to perform some calculations in my image and save the array obtained into a .png file. I deal with 3000 to 4000 images. To perform all these I use a shell script in Ubuntu. It gets the job done. But is there anyway to make it fast. I have 4 cores in my machine. How to use all of them. The script I am using is below
#!/bin/bash
cd $1
for i in $(ls *.png)
do
python ../tempcalc12.py $i
done
cd ..
tempcalc12.py is my python script
This question might be trivial. But I am really new to programming.
Thank you
xargs has --max-procs= ( or -P) option which does the job in parallel.
The following code does the job in maximum of 4 processes.
ls *.png | xargs -n 1 -P 4 python ../tempcalc12.py
You can just add a & to the python line to have everything executed in parallel:
python ../tempcalc12.py $i &
This is a bad idea though, as having too many processes will just slow everything down.
What you can do is limit the number of threads, like this:
MAX_THREADS=4
for i in $(ls *.png); do
python ../tempcalc12.py $i &
while [ $( jobs | wc -l ) -ge "$MAX_THREADS" ]; do
sleep 0.1
done
done
Every 100ms, it will check the number of running jobs, and if it is inferior to MAX_THREADS, add new jobs in background.
This is a nice hack if you just want a quick working solution, but you might also want to investigate what GNU Parallel can do.
If you have GNU Parallel you can do:
parallel python ../tempcalc12.py ::: *.png
It will do The Right Thing by spawning a job per core, even if the names your PNGs have space, ', or " in them. It also makes sure the output from different jobs are not mixed together, so if you use the output you are guaranteed that you will not get half-a-line from two different jobs.
GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to.
If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:
GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:
Installation
If GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:
(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash
For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README
Learn more
See more examples: http://www.gnu.org/software/parallel/man.html
Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html
Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel
What is the best way to run all Python files in a directory?
python *.py
only executes one file. Writing one line per file in a shell script (or make file) seems cumbersome. I need this b/c I have a series of small matplotlib scripts each creating a png file and want to create all of the images at once.
PS: I'm using the bash shell.
bash has loops:
for f in *.py; do python "$f"; done
An alternative is to use xargs. That allows you to parallelise execution, which is useful on today's multi-core processors.
ls *.py|xargs -n 1 -P 3 python
The -n 1 makes xargs give each process only one of the arguments, while the -P 3 will make xargs run up to three processes in parallel.