Python multiprocessing pool with imported function

Python multiprocessing pool with imported function - python

Using windows and python 3.6 I am having trouble with the multiprocessing once I import the function which is to be mapped.
This code works (gives output [2, 12, 30, 56, 90]):
from multiprocessing import Pool
def f(x,y):
return x*y
if __name__ == '__main__':
args=[(1,2),(3,4),(5,6),(7,8),(9,10)]
with Pool(4) as p:
result=p.starmap(f,args)
print(result)
Now I move the function f to a different .py file called 'test' and instead import it:
from multiprocessing import Pool
from test import f
if __name__ == '__main__':
args=[(1,2),(3,4),(5,6),(7,8),(9,10)]
with Pool(4) as p:
result=p.starmap(f,args)
print(result)
with test.py only containing:
def f(x,y):
return x*y
Running this leads to an infinite loop (doesn't return anything with cpu usage high).
What is causing this and is there a way to fix it? I have successfully got multiprocessing to work on a program by copying all the code into one huge .py which obviously is not ideal.

Related

Python extract element from iterator

I have the following code but cannot get out the results from the iterator
from multiprocess import freeze_support
from pathos.multiprocessing import ProcessPool
if __name__ == "__main__":
freeze_support()
pool = ProcessPool(nodes=4)
results = pool.uimap(pow, [1,2,3,4], [5,6,7,8])
print("...")
print(list(results))
The code does not error it just hangs.

There are a couple subtleties to get this to work, but the short version is imap or uimap are iterators unlike map in the python multiprocessing example. To extract the results it needs to be in a for loop. If inside a class you also need the called method to be a #staticmethod
from multiprocessing import freeze_support
from multiprocessing import Pool
def f(vars):
return vars[0]**vars[1]
if __name__ == "__main__":
freeze_support()
pool = Pool(4)
for run in pool.imap(f, [(1,5), (2,8), (3,9)]):
print(run)

How to write multiprocessing code correctly in Python 2

I'm trying to implement very simple multiprocessing code in python 2.7, but it looks like the code run serially and not parallel.
The following code prints *****1***** while I expect it to print *****2***** immediately after *****1*****.
import os
import multiprocessing
from time import sleep
def main():
func1_proc = multiprocessing.Process(target=func1())
func2_proc = multiprocessing.Process(target=func2())
func1_proc.start()
func2_proc.start()
pass
def func1():
print "*****1*****"
sleep(100)
def func2():
print "*****2*****"
sleep(100)
if __name__ == "__main__":
main()

You're calling func1 and func2 before passing their returning values to Process, so func1 is going to sleep 100 seconds before returning None, for which Process will raise an error.
You should pass function objects to Process instead so that it will run them in separate processes:
func1_proc = multiprocessing.Process(target=func1)
func2_proc = multiprocessing.Process(target=func2)

Python multiprocessing with dill.load

I'm having great difficulties in getting functions executed in MultiProcessing Pool method that are loaded through
dill.load('somefile.sav','rb')
The code is something like this:
import dill as dill
import multiprocessing_on_dill as mp
dill_func = dill.load('somefile.sav','rb')
def some_mp_func(x):
dill_func(x)
if (__name__ == '__main__'):
__spec__ = "ModuleSpec(name='builtins', loader=<class '_frozen_importlib.BuiltinImporter'>)"
x="test"
pool = mp.Pool(processes = (8))
pool.map(some_mp_func, x)
pool.close()
pool.join()
dill_func is an SKlearn Pipeline.
The output is:
NameError: name 'Y' is not defined
Where 'Y' is a function within dill_func, part of a Class in dill_func.
Running some_mp_func(x) without Multiprocessing runs perfectly fine, no Name errors. Any suggestions?

SOLUTION: when dumping use this setting
dill.settings['recurse']=True

Parallel processing a function that's in a separate module

I have what should be an "embarrasingly parallel" task: I'm trying to parse a number of log files in a CPU heavy manner. I don't care about the order they're done in, and the processes don't need to share any resources or threads.
I'm on a Windows machine.
My setup is something like:
main.py
import parse_file
import multiprocessing
...
files_list = ['c:\file1.log','c:\file2.log']
if __name__ == '__main__':
pool = multiprocessing.Pool(None)
for this_file in files_list:
r = pool.apply_async(parse_file.parse, (this_file, parser_config))
results = r.get()
...
#Code to do stuff with the results
parse_file is basically an entirely self-contained module that doesn't access any shared resources - the results are returned as a list.
This all runs absolutely fine when I ran it without multiprocessing, but when I enable it, what happens is that I get a huge wall of errors that indicate that the source module (the one that is in) is the one that's being run in parrallel. (The error is a database locking error for a something that is only in the source script (not the parse_file module), and at a point before the multiprocessing stuff!)
I don't pretend to understand the multiprocessing module, and worked from other examples here , but none of them include anything that indicates this is normal or why it's happening.
What am I doing wrong? How do I multi-process this task?
Thanks!
Easily replicable using this:
test.py
import multiprocessing
import test_victim
files_list = ['c:\file1.log','c:\file2.log']
print("Hello World")
if __name__ == '__main__':
pool = multiprocessing.Pool(None)
results = []
for this_file in files_list:
r = pool.map_async(test_victim.calculate, range(10), callback=results.append)
results = r.get()
print(results)
test_victim.py:
def calculate(value):
return value * 10
The output when you run test.py should be:
Hello World
[0, 10, 20, 30, 40, 50, 60, 70, 80, 90]
But in reality it is:
Hello World
[0, 10, 20, 30, 40, 50, 60, 70, 80, 90]
Hello World
Hello World
(The actual number of extra "Hello World"s) changes every time I run it between 1 and 4 = there should be none)

On Windows, when Python executes
pool = multiprocessing.Pool(None)
new Python processes are spawned. Because Windows does not have os.fork, these new Python processes re-import the calling module. Thus, anything not inside
if __name__ == '__main__':
gets executed once for each process spawned. That is why you are seeing multiple Hello Worlds.
Be sure to read the "Safe importing of main module" warning in the docs.
So to fix, put all the code that needs to run only once inside the
if __name__ == '__main__':
statement.
For example, your runnable example would be fixed by placing
print("Hello World")
inside the if __name__ == '__main__' statement:
import multiprocessing
import test_victim
files_list = ['c:\file1.log','c:\file2.log']
def main():
print("Hello World")
pool = multiprocessing.Pool(None)
results = []
for this_file in files_list:
r = pool.map_async(test_victim.calculate, range(10), callback=results.append)
results = r.get()
print(results)
if __name__ == '__main__':
main()
yields
Hello World
[0, 10, 20, 30, 40, 50, 60, 70, 80, 90]
Especially on Windows, scripts that use multiprocessing must be both runnable (as a script) and importable. An easy way to make a script importable is to structure it as is shown above. Place everything that the script should execute inside a function called main, and then just use
if __name__ == '__main__':
main()
at the end of the script. The stuff before main should just be import statements and the definition of global constants.

asynchronous subprocess with Popen

Using Windows 7 + python 2.6, I am trying to run a simulation model in parallel. I can launch multiple instances of the executable by double-clicking on them in my file browser. However, asynchronous calls with Popen result in each successive instance interrupting the previous one. For what it's worth, the executable returns text to the console, but I don't need to collect results interactively.
Here's where I am so far:
import multiprocessing, subprocess
def run(c):
exe = os.path.join("<location>","folder",str(c),"program.exe")
run = os.path.join("<location>","folder",str(c),"run.dat")
subprocess.Popen([exe,run],creationflags = subprocess.CREATE_NEW_CONSOLE)
def main():
pool = multiprocessing.Pool(3)
for c in range(10):
pool.apply_async(run,(str(c),))
pool.close()
pool.join()
if __name__ == '__main__':
main()
After scouring SO for a solution, I've learned that using multiprocessing may be redundant, but I need some way to limit the number of cores working.

Enabled by #J.F. Sebastian's comment regarding the cwd argument.
import multiprocessing, subprocess
def run(c):
exe = os.path.join("<location>","folder",str(c),"program.exe")
run = os.path.join("<location>","folder",str(c),"run.dat")
subprocess.check_call([exe,run],cwd=os.path.join("<location>","folder"),creationflags = subprocess.CREATE_NEW_CONSOLE)
def main():
pool = multiprocessing.Pool(3)
for c in range(10):
pool.apply_async(run,(str(c),))
pool.close()
pool.join()
if __name__ == '__main__':
main()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python multiprocessing pool with imported function - python

Related

Python extract element from iterator

How to write multiprocessing code correctly in Python 2

Python multiprocessing with dill.load

Parallel processing a function that's in a separate module

asynchronous subprocess with Popen

Categories

Resources