parallelize external program in python call?

parallelize external program in python call? - python

I have an external program which I can not change.
It reads an input file, does some calculations, and writes out a result file. I need to run this for a million or so combinations of input parameters.
The way I do it at the moment is, that I open a template file, change some strings in it (to input the new parameters), write it out, start the program using os.popen(), read the output file, do a chisquare test on the result, and then I restart with a different set of parameters.
The external program is only running on one core, so I tried to split my parameters space up and started multiple instances in different folders. Different folders were necessary because the program overwrites its output file. This works, but it still took about ~24 hours to finish.
Is it possible to run this as seperate processes without the result file being overwritten? Or do you see any other thing I could do to speed this up?
Thx.

Related

python process creating file with inflated size

i have a python process which takes a file containing streamed data and converts it into a format ready to load to a database. i have just migrated this process from one Linux GCP VM to another running exactly the same code, but the final output file size is nearly 4 times as big. 500mb vs 2gb.
When i download the files and manually inspect them, they look exactly the same to the eye.
Any ideas what could be causing this?
Edit: Thanks for the feedback, i traced it back to the input file, which is slightly different (as my stream recording process has also been migrated)
I am now trying to work out why a marginally different file creates such a different output file once its been processed.

Simultaneous Python and C++ run with read and write files

So this one is a doozie, and a little too specific to find an answer online.
I am writing to a file in C++ and reading that file in Python at the same time to move a robot. Or trying to.
When I try running both programs at the same time, the C++ one runs first and then the Python one runs.
Here's the command I use:
./ColorFollow & python fileToHex.py
This happens even if I switch the order of commands.
Even if I run them in different terminals (which is the same thing, just covering all bases).
Both the Python and C++ code read / write in 'infinite' loops, so these two should run until I say stop.
The code works fine; when the Python script finally runs the robot moves as intended. It's just that the code doesn't run at the same time.
Is there a way to make this happen, or is this impossible?
If you need more information, lemme know, but the code is pretty much what you'd expect it to be.

If you are using Linux, & will release bash session and in this case, CollorFlow and fileToXex.py will run in different bash sessions.
At the same time, composition ./ColorFollow | python fileToHex.py looks interesting, cause you redirect stdout of ColorFollow to fileToHex.py stdin - it can syncronize scripts by printing some code string upon exit, then reading it by fileToHex.py and exit as well.
I would create some empty file like /var/run/ColorFollow.flag and write there 1 when one of processes exit. Not a pipe - cause we do not care which process will start first. So, if next loop step of ColorFollow sees 1 in the file, it deletes it and exits (means that fileToHex already exited). The same - for fileToHex - check flag file each loop step and exit if it exists, after deleting flag file.

Recommendation on how to write a good python wrapper LSF

I am creating a python wrapper script and was wondering what'd be a good way to create it.
I want to run code serially. For example:
Step 1.
Run same program (in parallel - the parallelization is easy because I work with an LSF system so I just submit three different jobs).
I run the program in parallel, and each run takes one fin.txt and outputs one fout.txt, i.e., when they all run they would produce 3 output files from the three input files, f1in.txt, f2in.txt, f3in.txt, f1out.txt, f2out.txt, f3out.txt.
(in the LSF system) When each run of the program is completed successfully, it produces a log file output, f1log.out, f2log.out, f3log.out.
The log files output are of this form, i.e., f1log.out would look something like this if it runs successfully.
------------------------------------------------------------
# LSBATCH: User input
------------------------------------------------------------
Successfully completed.
Resource usage summary:
CPU time : 86.20 sec.
Max Memory : 103 MB
Max Swap : 881 MB
Max Processes : 4
Max Threads : 5
The output (if any) is above this job summary.
Thus, I'd like my wrapper to check (every 5 min or so) for each run (1,2,3) if the log file has been created, and if it has been created I'd like the wrapper to check if it was successfully completed (aka, if the string Successfully completed appears in the log file).
Also if the one of the runs finished and produces a log file that was not successfully completed I'd like my wrapper to end and report that run (k=1,2,3) was not completed.
After that,
Step2. If all three runs are successfully completed I would run another program that takes those three files as input... else I'd print an error.
Basically in my question I am looking for two things:
Does it sound like a good way to write a wrapper?
How in python I can check the existence of a file, and search for a pattern every certain time in a good way?
Note. I am aware that LSF has job dependencies but I find this way to be more clear and easy to work with, though may not be optimal.

I'm a user of an LSF system, and my major gripes are exit handling, and cleanup. I think a neat idea would be to send a batch job array that has for instance: Initialization Task, Legwork Task, Cleanup Task. The LSF could complete all three and send a return code to the waiting head node. Alot of times LSF works great to send one job or command, but it isn't really set up to handle systematic processing.
Other than that I wish you luck :)

Is there a way to get (python) script output while running using org-mode

I'm writing small bits of code inside org-mode files. This bits of code are slow (copy files from remote machines) and I wish to see how the copy progress (sometimes the connection to the remote machine fails and I wish to know). For that, I want to print the serial number of the currently accessed file.
Org-mode's code-block have two problems with this:
It places the either the printed messages or the returned variable in the results part of the block.
It does so only once the code ends.
Is there a way to get the printed output to a separated, live variable?

Reopening sys.stdout so that it is flushed should help.
See How to flush output of Python print?, and this blog post:
http://algorithmicallyrandom.blogspot.com.es/2009/10/python-tips-and-tricks-flushing-stdout.html

Python: Reading New Information Added To Massive Files

I'm working on a Python script to parse Squid(http://www.squid-cache.org/) log files. While the logs are rotated every day to stop them getting to big, they do reach between 40-90MB by the end of each day.
Essentially what I'm doing is reading the file line by line, parsing out the data I need(IP, Requested URL, Time) and adding it to an sqlite database. However this seems to be taking a very long time(It's been running over 20 minutes now)
So obviously, re-reading the file can't be done. What I would like to do is read the file and then detect all new lines written. Or even better, at the start of the day the script will simply read the data in real time as it is added so there will never be any long processing times.
How would I go about doing this?

One way to achieve this is by emulating tail -f. The script would constantly monitor the file and process each new line as it appears.
For a discussion and some recipes, see tail -f in python with no time.sleep

One way to do this is to use file system monitoring with py-inotify http://pyinotify.sourceforge.net/ - and set a callback function to be executed whenever
the log file size changed.
Another way to do it, without requiring external modules, is to record in the filesystem
(possibily on your sqlite database itself), the offset of the end of the lest line read on the log file, (which you get with with file.tell() ), and just read the newly added lines
from that offset onwards, which is done with a simple call to file.seek(offset) before looping through the lines.
The main difference of keeping track of the offset and the "tail" emulation described ont he other post is that this one allows your script to be run multiple times, i.e. - no need for it to be running continually, or to recover in case of a crash.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.