Multiprocessing never executing function keeps repeating code before function - python

I have a multiprocessing pool , that runs with 1 thread, and it keeps repeating the code before my function, i have tried with different threads, and also, i make things like this quite a bit, so i think i know what is causing the problem but i dont understand why, usually i use argparse to to parse files from the user, but i instead wanted to use input, no errors are thrown so i honestly have no clue.
from colorama import Fore
import colorama
import os
import ctypes
import multiprocessing
from multiprocessing import Pool
import random
colorama.init(autoreset=False)
print("headerhere")
#as you can see i used input instead of argparse
g = open(input(Fore.RED + " File Path?: " + Fore.RESET))
gg = open(input(Fore.RED + "File Path?: " + Fore.RESET))
#I messed around with this to see if it was the problem, ultimately disabling it until i fixed it, i just use 1 thread
threads = int(input(Fore.RED + "Amount of Threads?: " + Fore.RESET))
arrange = [lines.replace("\n", "")for lines in g]
good = [items.replace("\n", "") for items in gg]
#this is all of the code before the function that Pool calls
def che(line):
print("f")
#i would show my code but as i said this isnt the problem since ive made programs like this before, the only thing i changed is how i take file inputs from the user
def main():
pool = Pool(1)
pool.daemon = True
result = pool.map(che, arrange)
if __name__ == "__main__":
main()
if __name__ == "__main__":
main()

Here's a minimal, reproducible example of your issue:
from multiprocessing import Pool
print('header')
def func(n):
print(f'func {n}')
def main():
pool = Pool(3)
pool.map(func,[1,2,3])
if __name__ == '__main__':
main()
On OSes where "spawn" (Windows and MacOS) or "forkserver" (some Unix) are the default start methods, the sub-process imports your script. Since print('header') is at global scope, it will run the first time a script is imported into a process, so the output is:
header
header
header
header
func 1
func 2
func 3
A multiprocessing script should have everything meant to run once inside function(s), and they should be called once by the main script via if_name__ == '__main__':, so the solution is to move it into your def main()::
from multiprocessing import Pool
def func(n):
print(f'func {n}')
def main():
print('header')
pool = Pool(3)
pool.map(func,[1,2,3])
if __name__ == '__main__':
main()
Output:
header
func 1
func 2
func 3

If you want the top level code before the definition of che to only be executed in the master process, then place it in a function and call that function in main.
In multiprocessing, the top level statements will be interpreted/executed by both the master process and every child process. So, if some code should be executed only by the master and not by the children, then such code should not placed that at the top-level. Instead, such code should be placed in functions and these functions should be invoked in the main scope, i.e., in the scope of if block controlled by __main__ (or called in the main function in your code snippet).

Related

How to call a linux command line program in parallel with python

I have a command-line program which runs on single core. It takes an input file, does some calculations, and returns several files which I need to parse to store the produced output.
I have to call the program several times changing the input file. To speed up the things I was thinking parallelization would be useful.
Until now I have performed this task calling every run separately within a loop with the subprocess module.
I wrote a script which creates a new working folder on every run and than calls the execution of the program whose output is directed to that folder and returns some data which I need to store. My question is, how can I adapt the following code, found here, to execute my script always using the indicated amount of CPUs, and storing the output.
Note that each run has a unique running time.
Here the mentioned code:
import subprocess
import multiprocessing as mp
from tqdm import tqdm
NUMBER_OF_TASKS = 4
progress_bar = tqdm(total=NUMBER_OF_TASKS)
def work(sec_sleep):
command = ['python', 'worker.py', sec_sleep]
subprocess.call(command)
def update_progress_bar(_):
progress_bar.update()
if __name__ == '__main__':
pool = mp.Pool(NUMBER_OF_TASKS)
for seconds in [str(x) for x in range(1, NUMBER_OF_TASKS + 1)]:
pool.apply_async(work, (seconds,), callback=update_progress_bar)
pool.close()
pool.join()
I am not entirely clear what your issue is. I have some recommendations for improvement below, but you seem to claim on the page that you link to that everything works as expected and I don't see anything very wrong with the code as long as you are running on Linux.
Since the subprocess.call method is already creating a new process, you should just be using multithreading to invoke your worker function, work. But had you been using multiprocessing and your platform was one that used the spawn method to create new processes (such as Windows), then having the creation of the progress bar outside of the if __name__ = '__main__': block would have resulted in the creation of 4 additional progress bars that did nothing. Not good! So for portability it would have been best to move its creation to inside the if __name__ = '__main__': block.
import subprocess
from multiprocessing.pool import ThreadPool
from tqdm import tqdm
def work(sec_sleep):
command = ['python', 'worker.py', sec_sleep]
subprocess.call(command)
def update_progress_bar(_):
progress_bar.update()
if __name__ == '__main__':
NUMBER_OF_TASKS = 4
progress_bar = tqdm(total=NUMBER_OF_TASKS)
pool = ThreadPool(NUMBER_OF_TASKS)
for seconds in [str(x) for x in range(1, NUMBER_OF_TASKS + 1)]:
pool.apply_async(work, (seconds,), callback=update_progress_bar)
pool.close()
pool.join()
Note: If your worker.py program prints to the console, it will mess up the progress bar (the progress bar will be re-written repeatedly on multiple lines).
Have you considered instead importing worker.py (some refactoring of that code might be necessary) instead of invoking a new Python interpreter to execute it (in this case you would want to be explicitly using multiprocessing). On Windows this might not save you anything since a new Python interpreter would be executed for each new process anyway, but this could save you on Linux:
import subprocess
from multiprocessing.pool import Pool
from worker import do_work
from tqdm import tqdm
def update_progress_bar(_):
progress_bar.update()
if __name__ == '__main__':
NUMBER_OF_TASKS = 4
progress_bar = tqdm(total=NUMBER_OF_TASKS)
pool = Pool(NUMBER_OF_TASKS)
for seconds in [str(x) for x in range(1, NUMBER_OF_TASKS + 1)]:
pool.apply_async(do_work, (seconds,), callback=update_progress_bar)
pool.close()
pool.join()

Python running subprocesses without waiting whilst still receiving return codes

I have found relating questions to mine but cannot find one that solves my problem.
The problem
I am building a program that monitors several directories, then spawns a subprocess based on directory or particular filename.
These subprocesses can often take up to several hours (for example if rendering 000's of PDFs) to complete. Because of this, I would like to know the best way for the program to continue monitoring the folders in parallel to the subprocess that is still running, and be able to spawn additional subprocesses, as long as they are of a different type to the subprocess currently running.
Once the subprocess has completed, the program should be able to receive a return code, that subprocess would be available to run again.
Code as it stands
This is the simple code that runs the program currently, calling functions when a file is found:
while 1:
paths_to_watch = ['/dir1','/dir2','/dir3','/dir4']
after = {}
for x in paths_to_watch:
key = x
after.update({key :[f for f in os.listdir(x)]})
for key, files in after.items():
if(key == '/dir1'):
function1(files)
elif(key == '/dir2'):
function2(files)
elif(key == '/dir3'):
function3(files)
elif(key == '/dir4'):
function3(files)
time.sleep(10)
Of course this means that the program waits for the process to be finished before it continues to check for files in paths_to_watch
From other questions, it looks like this is something that could be handled with process pools, however my lack of knowledge in this area means I do not know where to start.
I am assuming that you can use threads rather than processes, an assumption that will hold up if your functions function1 thorugh function4 are predominately I/O bound. Otherwise you should substitute ProcessPoolExecutor for ThreadPoolExecutor in the code below. Right now your program loops indefinitely, so the threads too will never terminate. I am also assuming that that functions function1 through function4 have unique implementations.
import os
import time
from concurrent.futures import ThreadPoolExecutor
def function1(files):
pass
def function2(files):
pass
def function3(files):
pass
def function4(files):
pass
def process_path(path, function):
while True:
files = os.listdir(path)
function(files)
time.sleep(10)
def main():
paths_to_watch = ['/dir1','/dir2','/dir3','/dir4']
functions = [function1, function2, function3, function4]
with ThreadPoolExecutor(max_workers=len(paths_to_watch)) as executor:
results = executor.map(process_path, paths_to_watch, functions)
for result in results:
# threads never return so we never get a result
print(result)
if __name__ == '__main__':
main()

Multiprocessing Pool initializer fails pickling

I am trying to use the multiprocessing.Pool to implement a multithread application. To share some variables I am using a Queue as hinted here:
def get_prediction(data):
#here the real calculation will be performed
....
def mainFunction():
def get_prediction_init(q):
print("a")
get_prediction.q = q
queue = Queue()
pool = Pool(processes=16, initializer=get_prediction_init, initargs=[queue,])
if __name__== '__main__':
mainFunction()
This code is running perfectly on a Debian machine, but is not working at all on another Windows 10 device. It fails with the error
AttributeError: Can't pickle local object 'mainFunction.<locals>.get_prediction_init'
I do not really know what exactly is causing the error. How can I solve the problem so that I can run the code on the Windows device as well?
EDIT: The problem is solved if I create the get_predediction_init function on the same level as the mainFunction. It has only failed when I defined it as an inner function. Sorry for the confusion in my post.
The problem is in something you haven't shown us. For example, it's a mystery where "mainFunction" came from in the AttributeError message you showed.
Here's a complete, executable program based on the fragment you posted. Worked fine for me under Windows 10 just now, under Python 3.6.1 (I'm guessing you're using Python 3 from your print syntax), printing "a" 16 times:
import multiprocessing as mp
def get_prediction(data):
#here the real calculation will be performed
pass
def get_prediction_init(q):
print("a")
get_prediction.q = q
if __name__ == "__main__":
queue = mp.Queue()
pool = mp.Pool(processes=16, initializer=get_prediction_init, initargs=[queue,])
pool.close()
pool.join()
Edit
And, based on your edit, this program also works fine for me:
import multiprocessing as mp
def get_prediction(data):
#here the real calculation will be performed
pass
def get_prediction_init(q):
print("a")
get_prediction.q = q
def mainFunction():
queue = mp.Queue()
pool = mp.Pool(processes=16, initializer=get_prediction_init, initargs=[queue,])
pool.close()
pool.join()
if __name__ == "__main__":
mainFunction()
Edit 2
And now you've moved the definition of get_prediction_init() into the body of mainFunction. Now I can see your error :-)
As shown, define the function at module level instead. Trying to pickle local function objects can be a nightmare. Perhaps someone wants to fight with that, but not me ;-)

Python Multiprocessing Causing Entire Script to Loop

It seems multiprocessing swaps between threads faster so I started working on swapping over but I'm getting some unexpected results. It causes my entire script to loop several times when a thread didn't before.
Snippet example:
stuff_needs_done = true
more_stuff_needs_done = true
print "Doing stuff"
def ThreadStuff():
while 1 == 1:
#do stuff here
def OtherThreadStuff():
while 1 == 1:
#do other stuff here
if stuff_needs_done == true:
Thread(target=ThreadStuff).start()
if more_stuff_needs_done == true:
Thread(target=OtherThreadStuff).start()
This works as I'd expect. The threads start and run until stopped. But when running a lot of these the overhead is higher (so I'm told) so I tried swapping to multiprocessing.
Snippet example:
stuff_needs_done = true
more_stuff_needs_done = true
print "Doing stuff"
def ThreadStuff():
while 1 == 1:
#do stuff here
def OtherThreadStuff():
while 1 == 1:
#do other stuff here
if stuff_needs_done == true:
stuffproc1= Process(target=ThreadStuff).start()
if more_stuff_needs_done == true:
stuffproc1= Process(target=OtherThreadStuff).start()
But what seems to happen is the whole thing starts a couple of times so the "Doing stuff" output comes up and a couple of the threads run.
I could put some .join()s in but there is no loop which should cause the print output to run again which means there is nowhere for it to wait.
My hope is this is just a syntax thing but I'm stumped trying to find out why the whole script loops. I'd really appreciate any pointers in the right direction.
This is mentioned in the docs:
Safe importing of main module
Make sure that the main module can be safely imported by a new Python interpreter without causing unintended side effects (such a
starting a new process).
For example, under Windows running the following module would fail with a RuntimeError:
from multiprocessing import Process
def foo():
print 'hello'
p = Process(target=foo)
p.start()
Instead one should protect the “entry point” of the program by using if __name__ == '__main__': as follows:
from multiprocessing import Process, freeze_support
def foo():
print 'hello'
if __name__ == '__main__':
freeze_support()
p = Process(target=foo)
p.start()
This allows the newly spawned Python interpreter to safely import the module and then run the module’s foo() function.

Windows multiprocessing

As I have discovered windows is a bit of a pig when it comes to multiprocessing and I have a questions about it.
The pydoc states you should protect the entry point of a windows application when using multiprocessing.
Does this mean only the code which creates the new process?
For example
Script 1
import multiprocessing
def somemethod():
while True:
print 'do stuff'
# this will need protecting
p = multiprocessing.Process(target=somemethod).start()
# this wont
if __name__ == '__main__':
p = multiprocessing.Process(target=somemethod).start()
In this script you need to wrap this in if main because the line in spawning the process.
But what about if you had?
Script 2
file1.py
import file2
if __name__ == '__main__':
p = Aclass().start()
file2.py
import multiprocessing
ITEM = 0
def method1():
print 'method1'
method1()
class Aclass(multiprocessing.Process):
def __init__(self):
print 'Aclass'
super(Aclass, self).__init__()
def run(self):
print 'stuff'
What would need to be protected in this instance?
What would happen if there was a if __main__ in File 2, would the code inside of this get executed if a process was being created?
NOTE: I know the code will not compile. It's just an example.
The pydoc states you should protect the entry point of a windows application when using multiprocessing.
My interpretation differs: the documentations states
the main module can be safely imported by a new Python interpreter without causing unintended side effects (such a starting a new process).
So importing your module (import mymodule) should not create new processes. That is, you can avoid starting processes by protecting your process-creating code with an
if __name__ == '__main__':
...
because the code in the ... will only run when your program is run as main program, that is, when you do
python mymodule.py
or when you run it as an executable, but not when you import the file.
So, to answer your question about the file2: no, you do not need protection because no process is started during the import file2.
Also, if you put an if __name__ == '__main__' in file2.py, it would not run because file2 is imported, not executed as main program.
edit: here is an example of what can happen when you do not protect your process-creating code: it might just loop and create a ton of processes.

Categories