Recommendation on how to write a good python wrapper LSF - python

I am creating a python wrapper script and was wondering what'd be a good way to create it.
I want to run code serially. For example:
Step 1.
Run same program (in parallel - the parallelization is easy because I work with an LSF system so I just submit three different jobs).
I run the program in parallel, and each run takes one fin.txt and outputs one fout.txt, i.e., when they all run they would produce 3 output files from the three input files, f1in.txt, f2in.txt, f3in.txt, f1out.txt, f2out.txt, f3out.txt.
(in the LSF system) When each run of the program is completed successfully, it produces a log file output, f1log.out, f2log.out, f3log.out.
The log files output are of this form, i.e., f1log.out would look something like this if it runs successfully.
------------------------------------------------------------
# LSBATCH: User input
------------------------------------------------------------
Successfully completed.
Resource usage summary:
CPU time : 86.20 sec.
Max Memory : 103 MB
Max Swap : 881 MB
Max Processes : 4
Max Threads : 5
The output (if any) is above this job summary.
Thus, I'd like my wrapper to check (every 5 min or so) for each run (1,2,3) if the log file has been created, and if it has been created I'd like the wrapper to check if it was successfully completed (aka, if the string Successfully completed appears in the log file).
Also if the one of the runs finished and produces a log file that was not successfully completed I'd like my wrapper to end and report that run (k=1,2,3) was not completed.
After that,
Step2. If all three runs are successfully completed I would run another program that takes those three files as input... else I'd print an error.
Basically in my question I am looking for two things:
Does it sound like a good way to write a wrapper?
How in python I can check the existence of a file, and search for a pattern every certain time in a good way?
Note. I am aware that LSF has job dependencies but I find this way to be more clear and easy to work with, though may not be optimal.

I'm a user of an LSF system, and my major gripes are exit handling, and cleanup. I think a neat idea would be to send a batch job array that has for instance: Initialization Task, Legwork Task, Cleanup Task. The LSF could complete all three and send a return code to the waiting head node. Alot of times LSF works great to send one job or command, but it isn't really set up to handle systematic processing.
Other than that I wish you luck :)

Related

Debug a Python program which seems paused for no reason

I am writing a Python program to analyze log files. So basically I have about 30000 medium-size log files and my Python script is designed to perform some simple (line-by-line) analysis of each log file. Roughly it takes less than 5 seconds to process one file.
So once I set up the processing, I just left it there and after about 14 hours when I came back, my Python script simply paused right after analyzing one log file; seems that it hasn't written into the file system for the analyzing output of this file, and that's it. No more proceeding.
I checked the memory usage, it seems fine (less than 1G), I also tried to write to the file system (touch test), it also works as normal. So my question is that, how should I proceed to debug the issue? Could anyone share some thoughts on that? I hope this is not too general. Thanks.
You may use Trace or track Python statement execution and/or The Python Debugger module.
Try this tool https://github.com/khamidou/lptrace with command:
sudo python lptrace -p <process_id>
It will print every python function your program invokes and may help you understand where your program stucks or in an infinity loop.
If it does not output anything, that's proberbly your program get stucks, so try
pstack <process_id>
to check the stack trace and find out where stucks. The output of pstack is c frames, but I believe somehow you can find something useful to solve your problem.

parallelize external program in python call?

I have an external program which I can not change.
It reads an input file, does some calculations, and writes out a result file. I need to run this for a million or so combinations of input parameters.
The way I do it at the moment is, that I open a template file, change some strings in it (to input the new parameters), write it out, start the program using os.popen(), read the output file, do a chisquare test on the result, and then I restart with a different set of parameters.
The external program is only running on one core, so I tried to split my parameters space up and started multiple instances in different folders. Different folders were necessary because the program overwrites its output file. This works, but it still took about ~24 hours to finish.
Is it possible to run this as seperate processes without the result file being overwritten? Or do you see any other thing I could do to speed this up?
Thx.

Link data from a running C code to a running Python code [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I am receiving data in my C code, and currently the only thing I can do is to just print the data on the screen. What I want to know if there's a possibility to instead of just printing the data on the screen, is it doable to send the data to a running Python code that will be able to further process the data? The data is in an array form. The device that will have the data printed on the screen will be running the Python code.
I am doing this to avoid saving the data in a file and then accessing the file from my python script, which will increase the delay for my time dependent system.
The C code is running on a low processing device that is just able to receive and the data without any processing.
(EDIT) A demonstration of what I mean has been included:
C side:
// code that receives the data periodically as an array
// when received do
printf(" x = %d y = %d z = %d ", data[0],data[1],data[2]);
// what I am looking for is something like this
sendtoPy("filename.py",data)
// repeat
Python side:
# this code just sits to wait for anything coming from C
# wait for any code to be received from datafromCexample.c
def waitforC("datafromCexample.c",data):
print "data has been received"
# process the data received into a list
# further processing
waitforC(data)
Of course this is just raw code for demonstration. If I made mistakes in my demonstration then please teach me and correct me, I simply want to learn. If this is not available, could you please include alternatives that can be used in this situation?
Thank you.
If the C program and the Python program/script can run on the same device, you have several options:
Let the C code output the data on standard output (this is probably what it's already doing), pipe it to the Python program and let the Python program read in on standard input.or
Let the C code output the data on standard output and call the C program from your Python code (e.g. with subprocess.check_output, processing the the produced output.or
In the C code, have a function that will return the data chunk by chunk (one chunk per call, where "chunk" is a quantity of a size you can easily produce and process). Use Cyton (see Tony's answer) or CFFI or any of the other available ways to call it from Python repeatedly to obtain the data to process.or
Write a Python script that can be invoked with a single chunk of data and process it. Call that Python script repeatedly for all chunks of data from your C code, using a system call to invoke it.or
Use Cyton to invoke a python function from C.
Depending on the operating system, you can get even fancier with named pipes, sockets, bus messages and whatnot, but the options above are probably already enough to choose from and should suffice for most use cases. Which one to choose depends on the nature of your data and of the wanted processing of it, as well as some non-functional requirements (e.g., what, if one of the processes is stalling, ...) so cannot be answered in general.
Pipe
In comments to this answer, you inquire
I am looking to go with the first option […]. I would appreciate if you could recommend me a library or procedure that can work well with arrays.
and
For the C side, I will need the library that through it I will pipe the data to the python code. In the Python side, I suppose I will need to use multi-threading that will allow me to 1- run forever (to anticipate any incoming data) 2- receive with the incoming data from the C code, hence a library that will be able to do so
The beauty of pipes is that the involved processes don't need any special libraries in most programming languages. The processes just do something that almost every general purpose programming language allows in a (more or less) easy way: Reading from standard input and/or writing to standard output. That is also what command line applications do when they read what you type in on your keyboard and/or print text to the (today often virtual) terminal (a.k.a. a text console).
Forwarding one process' output to the other process' input (as if a user was reading what the first process prints and typing that in to the second process) is the operating system's task, so the processes don't have to be concerned with that. The first process just needs to output a format the second process can understand.
For example, the unix command ls will list the files and directories in the current working directory. Let's assume it aways prints one name per line. The unix command shuf reads lines from standard input until it detects an end-of-file character in them, then prints them in a random order. (It shuffles them.)
You can invoke shuf in a shell (e.g. sh or bash) and type in Hello World![Enter]I like shuffled text.[Enter]Do you, too?[Enter]No? What a pity.[Enter][Ctrl+d] and might get an output like:
Do you, too?
Hello World!
No? What a pity.
I like shuffled text.
If you want to list the files and directories in the current working directory, but in a different random order than returned by ls, you can pipe the output of ls through shuf. E.g. in sh or bash:
ls | shuf
would take the output of ls and forward it to shuf as-if it was typed in manually. (Only much quicker than you can type.)
Thus, if your C program produces machine-parseable output on the terminal (which might already be the case; You might want to include an example of its current terminal output in your question.), you can forward that to your python program with something like
./binary_of_your_c_program | python ./your_python_script.py
Your python program then just has to read from standard input and process what it gets there however you want it to. Most straight-forward approaches for reading standard input will loop over all lines received until an end-of-file occurs or they receive a signal that terminates the process, so keeping the python script perpetually running shouldn't be too hard. (Unless you also care about having it automatically restarted when it unexpectedly terminates, e.g. due to a runtime error.)
You may consider invoke C function from python code, read this example https://csl.name/post/c-functions-python/
If You care about time efficiency of Your system, I would use Cython for wrapping the c part of the program that You have. You can read the full tutorial on that here. You may also use the Cython for speeding up the python part of the code.

Persist Completed Pipeline in Luigi Visualiser

I'm starting to port a nightly data pipeline from a visual ETL tool to Luigi, and I really enjoy that there is a visualiser to see the status of jobs. However, I've noticed that a few minutes after the last job (named MasterEnd) completes, all of the nodes disappear from the graph except for MasterEnd. This is a little inconvenient, as I'd like to see that everything is complete for the day/past days.
Further, if in the visualiser I go directly to the last job's URL, it can't find any history that it ran: Couldn't find task MasterEnd(date=2015-09-17, base_url=http://aws.east.com/, log_dir=/home/ubuntu/logs/). I have verified that it ran successfully this morning.
One thing to note is that I have a cron that runs this pipeline every 15 minutes to check for a file on S3. If it exists, it runs, otherwise it stops. I'm not sure if that is causing the removal of tasks from the visualiser or not. I've noticed it generates a new PID every run, but I couldn't find a way to persist one PID/day in the docs.
So, my questions: Is it possible to persist the completed graph for the current day in the visualiser? And is there a way to see what has happened in the past?
Appreciate all the help
I'm not 100% positive if this is correct, but this is what I would try first. When you call luigi.run, pass it --scheduler-remove-delay. I'm guessing this is how long the scheduler waits before forgetting a task after all of its dependents have completed. If you look through luigi's source, the default is 600 seconds. For example:
luigi.run(["--workers", "8", "--scheduler-remove-delay","86400")], main_task_cls=task_name)
If you configure the remove_delay setting in your luigi.cfg then it will keep the tasks around for longer.
[scheduler]
record_task_history = True
state_path = /x/s/hadoop/luigi/var/luigi-state.pickle
remove_delay = 86400
Note, there is a typo in the documentation ("remove-delay" instead of remove_delay") which is being fixed under https://github.com/spotify/luigi/issues/2133

Automate Python Script

I'm running a python script manually that fetches data in JSON format.How do I automate this script to run automatically on an hourly basis?
I'm working on Windows7.Can I use tools like Task scheduler?If I can use it,what do I need to put in the batch file?
Can I use tools like Task scheduler?
Yes. Any tool that can run arbitrary programs can run your Python script. Pick the one you like best.
If I can use it,what do I need to put in the batch file?
What batch file? Task Scheduler takes anything that can be run, with arguments—a C program, a .NET program, even a document with a default app associated with it. So, there's no reason you need a batch file. Use C:\Python33\python.exe (or whatever the appropriate path is) as your executable, and your script's path (and its arguments, if any) as the arguments. Just as you do when running the script from the command line.
See Using the Task Scheduler in MSDN for some simple examples, and Task Scheduler Schema Elements or Task Scheduler Scripting Objects for reference (depending on whether you want to create the schedule in XML, or via the scripting interface).
You want to create an ExecAction with Path set to "C:\Python33\python.exe" and Arguments set to "C:\MyStuff\myscript.py", and a RepetitionPattern with Interval set to "PT1H". You should be able to figure out the rest from there.
As sr2222 points out in the comments, often you end up scheduling tasks frequently, and needing to programmatically control their scheduling. If you need this, you can control Task Scheduler's scripting interface from Python, or build something on top of Task Scheduler, or use a different tool that's a bit easier to get at from Python and has more helpful examples online, etc.—but when you get to that point, take a step back and look at whether you're over-using OS task scheduling. (If you start adding delays or tweaking times to make sure the daily foo1.py job never runs until 5 minutes after the most recent hourly foo0.py has finished its job, you're over-using OS task scheduling—but it's not always that obvious.)
May I suggest WinAutomation or AutoMate. These two do the exact same thing, except the UI is a little different. I prefer WinAutomation, because the scripts are a little easier to build.
Yes, you can use the Task Scheduler to run the script on an hourly bases.
To execute a python script via a Batch File, use the following code:
start path_to_python_exe path_to_python_file
Example:
start C:\Users\harshgoyal\AppData\Local\Continuum\Anaconda3\python.exe %UserProfile%\Documents\test_script.py
If python is set as Window’s Environment Window then you can reduce the syntax to:
start python %UserProfile%\Documents\test_script.py
What I generally do is run the batch file once via Task Scheduler and within the python script I call a thread/timer every hour.
class threading.Timer(interval, function, args=None, kwargs=None)

Categories