Losing queued tasks in Python apply_async - python

I am attempting to write a wrapper that iterates through running another program with different input files. The program (over which I have no control, but need to use) needs to be run out of the same directory as the input file(s). So far my method is to use OS-module to change/create directory structure, use apply-async to run the program given a sub-directory, and each child in apply-async changes directory, creates the file, and the first 8 processes run successfully (i have 8 virtual cores)
However, I am queueing up to 100 of these processes (they run a simulation which takes a few minutes, I'm looking to optimize). I use "call" on the outside executable I am running. I thought everything was going great, but then after the 8th simulation runs, everything stops, I check 0 processes are running. It is as if the queue forgot about the other processes.
What can I do to fix this? I know my RAM only goes up about 300 MB out of 8GB.
Do I need to look into implementing some sort of queue myself that waits for the exit code of the simulation executable?
Thank you in advance.

Maybe better than nothing. This shows a correct way to use apply_async(), and demonstrates that - yup - there's no problem creating many more tasks than processes. I'd tell you how this differs from what you're doing, but I have no idea what you're doing ;-)
import multiprocessing as mp
def work(i):
from time import sleep
sleep(2.0 if i & 1 else 1.0)
return i*i
if __name__ == "__main__":
pool = mp.Pool(4)
results = [pool.apply_async(work, (i,)) for i in range(100)]
results = [r.get() for r in results]
print len(results), results
pool.close()
pool.join()

Related

Efficient way of running multiple scripts simultaneously using Python

I have a csv file with 10,000 rows, each row contains a link, and I want to download some info of each link. As that's a consuming task I manually splitted it in 4 Python scripts, each one working on 2,500 rows. After that I open 4 terminals and run each of the scripts.
However I wonder if there's a more efficient way of doing that. Up to now I have 4 scripts .py that I manually lunch. What happens if I have to do the same but with 1,000,000 rows? Should I manually create for example 50 scripts and in each script download the info of the rows of that script?. I hope I managed to explain myself :)
Thanks!
You don't need to do any manual splitting – set up a multiprocessing.Pool() with the number of workers you want to be processing your data, and have a function do your work for each item. A simplified example:
import multiprocessing
# This function is run in a separate process
def do_work(line):
return f"{line} is {len(line)} characters long. This result brought to you by {multiprocessing.current_process().name}"
def main():
work_items = [f"{2 ** i}" for i in range(1_000)] # You'd read these from your file
with multiprocessing.Pool(4) as pool:
for result in pool.imap(do_work, work_items, chunksize=20):
print(result)
if __name__ == "__main__":
main()
This has (up to) 4 processes working on your data, with, for optimization reasons, each worker getting 20 tasks to work on.
If you don't need the results to be in order, use the faster imap_unordered.
You can take a look at https://docs.python.org/3/library/asyncio-task.html to make the download + processing tasks async.
Use Threads to run multiple interpreter instances simultaneously (https://realpython.com/intro-to-python-threading)

Is reusing same process name in loop situation possibly generate zombie process?

My script has to run over a day and its core cycle runs 2-3 times per a minute. I used multiprocessing to give a command simultaneously and each of them will be terminated/join within one cycle.
But in reality I found the software end up out of swap memory or computer freezing situation, I guess this is caused by accumulated processes. I can see on another session while running program, python PID abnormally increasing by time. So I just assume this must be something process thing. What I don't understand is how it happens though I made sure each cycle's process has to be finished on that cycle before proceed the next one.
so I am guessing, actual computing needs more time to progress 'terminate()/join()' job, so I should not "reuse" same object name. Is this proper guessing or is there other possibility?
def function(a,b):
try:
#do stuff # audio / serial things
except:
return
flag_for_2nd_cycle=0
for i in range (1500): # main for running long time
#do something
if flag_for_2nd_cycle==1:
while my_process.is_alive():
if (timecondition) < 30: # kill process if it still alive
my_process.terminate()
my_process.join()
flag_for_2nd_cycle=1
my_process=multiprocessing.process(target=function, args=[c,d])
my_process.start()
#do something and other process jobs going on, for example
my_process2 = multiprocessing.process() ##*stuff
my_process2.terminate()
my_process2.join()
Based on your comment, you are controlling three projectors over serial ports.
The simplest way to do that would be to simply open three serial connections (using pySerial). Then run a loop where you check for available data each of the connections and if so, read and process it. Then you send commands to each of the projectors in turn.
Depending on the speed of the serial link you might not need more than this.

python multiprocessing pool.map hangs

I cannot make even the simplest of examples of parallel processing using the multiprocessing package run in python 2.7 (using spyder as a UI on windows) and I need help figuring out the issue. I have run conda update so all of the packages should be up to date and compatible.
Even the first example in the multiprocessing package documentation (given below) wont work, it generates 4 new processes but the console just hangs. I have tried everything I can find over the last 3 days but none of the code that runs without hanging will allocate more than 25% of my computing power to this task (I have a 4 core computer).
I have given up on running the procedure I have designed and need parallel processing for at this point and I am only trying to get proof of concept so I can build from there. Can someone explain and point me in the right direction? Thanks
Example 1 from https://docs.python.org/2/library/multiprocessing.html
#
from multiprocessing import Pool
def f(x):
return x*x
if __name__ == '__main__':
p = Pool()
print(p.map(f, [1, 2, 3]))
Example 2 (modified from original) from http://chriskiehl.com/article/parallelism-in-one-line/
from multiprocessing import Pool
def fn(i):
return [i,i*i,i*i*i]
test = range(10)
if __name__ == '__main__':
pool = Pool()
results = map(fn,test)
pool.close()
pool.join()
I apologize if there is indeed an answer to this as it seems as though I should be able to manage such a modest task but I am not a programmer and the resources I have found have been less than helpful given my very limited level of knowledge. Please let me know what further information is needed.
Thank you.
After installing spyder on my virtualmachine, it seems to be a spyder specific bug. Example 1 works in IDLE, executed via the command line, executed from within spyder (first saved and then executed), but not when executed line by line in spyder.
I would suggest simply to create a new file in spyder, add the lines of code, save it, and then run it..
For related reports see:
https://groups.google.com/forum/#!topic/spyderlib/LP5d8QZTXd0
QtConse in Spyder cannot use multiprocessing.Manager
Multiprocessing working in Python but not in iPython
https://github.com/spyder-ide/spyder/issues/1900

Why does it look like multiple processes are being used when only 1 process is specified?

Please forgive me as I'm new to using the multiprocessing library in python and new to testing multiprocess/multi-threaded projects.
In some legacy code, someone created a pool of processes to execute multiple processes in parallel. I'm trying to debug the code by making the pool only have 1 process but the output looks like it's still using multiple processes.
Below is some sanitized example code. Hopefully I included all the important elements to demo what I'm experiencing.
def myTestFunc():
pool = multiprocessing.Pool(1) # should only use 1 process
for i in someListOfNames:
pool.apply_async(method1, args=(listA))
def method1(listA):
for i in listA:
print "this is the value of i: " + i
sys.stdout.flush()
What is happening is since I expect there should only be 1 process in the pool, I shouldn't have any output collision. What I see sometimes in the log msgs is this:
this is the value of i: Alpha
this is the value of i: Bravo
this is the this is the value of i: Mike # seems like 2 things trying to write at the same time
The two things writing at the same time seems to appear closer to the bottom of my debug log, rather than the top, which means the longer I run, the more likely I get these msgs overwriting each other. I haven't tested with a shorter list yet though.
I realize testing multi-process/multi-threaded programs is difficult but in this case, I think I've restricted it such that it should be a lot easier than normal to test. I'm confused why this is happening b/c
I set the pool to have only 1 process
(I think) I force the process to flush its write buffer so it should be writing w/o waiting/queuing and getting this situation.
Thanks in advance for any help you can give me.

Multiprocessing launching too many instances of Python VM

I am writing some multiprocessing code (Python 2.6.4, WinXP) that spawns processes to run background tasks. In playing around with some trivial examples, I am running into an issue where my code just continuously spawns new processes, even though I only tell it to spawn a fixed number.
The program itself runs fine, but if I look in Windows TaskManager, I keep seeing new 'python.exe' processes appear. They just keep spawning more and more as the program runs (eventually starving my machine).
For example,
I would expect the code below to launch 2 python.exe processes. The first being the program itself, and the second being the child process it spawns. Any idea what I am doing wrong?
import time
import multiprocessing
class Agent(multiprocessing.Process):
def __init__(self, i):
multiprocessing.Process.__init__(self)
self.i = i
def run(self):
while True:
print 'hello from %i' % self.i
time.sleep(1)
agent = Agent(1)
agent.start()
It looks like you didn't carefully follow the guidelines in the documentation, specifically this section where it talks about "Safe importing of main module".
You need to protect your launch code with an if __name__ == '__main__': block or you'll get what you're getting, I believe.
I believe it comes down to the multiprocessing module not being able to use os.fork() as it does on Linux, where an already-running process is basically cloned in memory. On Windows (which has no such fork()) it must run a new Python interpreter and tell it to import your main module and then execute the start/run method once that's done. If you have code at "module level", unprotected by the name check, then during the import it starts the whole sequence over again, ad infinitum
When I run this in Linux with python2.6, I see a maximum of 4 python2.6 processes and I can't guarantee that they're all from this process. They're definitely not filling up the machine.
Need new python version? Linux/Windows difference?
I don't see anything wrong with that. Works fine on Ubuntu 9.10 (Python 2.6.4).
Are you sure you don't have cron or something starting multiple copies of your script? Or that the spawned script is not calling anything that would start a new instance, for example as a side effect of import if your code runs directly on import?

Categories