python multiprocessing behaviour - python

I've noticed strange behaviour when running some python code that made use of the multiprocessing library. This is all under Windows and likely a Windows thing, but maybe someone could explain what's happening.
If I create a simple python script and create a pool like so:
import multiprocessing
pool = multiprocessing.Pool()
print "made a pool"
while True:
pass
when I run the script I see "made a pool" printed 8 times, which would be the default number of processes created by Pool() as I have 8 cores on my machine.
When I change the script to be like so:
import multiprocessing
def run():
pool = multiprocessing.Pool()
print "made a pool"
while True:
pass
if __name__ == '__main__':
run()
I see "made a pool" printed once - which is what I would have expected in both cases.
I guess I would normally run any code using the multiprocessing library from a function, but got caught out by this while playing with some code in a single python file. Anyone know why it happens?

Related

Python Multiprocessing Looping Python File Instead of Starting Process

I'm trying to get started with multiprocessing, and I'm running into some interesting issues. The code I'm using is below (for the record, this example is straight from the multiprocessing documentation):
from multiprocessing import Process
def f(name):
print('hello', name)
if __name__ == '__main__':
p = Process(target=f, args=('bob'))
p.start()
p.join()
This works fine, and prints "hello bob" as it should. When I add any additional code to the file though, before or after the if statement, then p does not evaluate, and my file loops back to the beginning and runs all over again endlessly. For example, the following code gives me this issue:
from multiprocessing import Process
def f(name):
print('hello', name)
if __name__ == '__main__':
p = Process(target=f, args=('bob'))
p.start()
p.join()
test_input = input("test input")
I am running Python using Windows 10, Pycharm v. 2021.3.2, and Python 3.10.0. Is this an issue that any of you have seen before? At this point I'm starting to wonder if perhaps it's even an issue between Windows and Pycharm or Windows and Python, or maybe just a case of inexperience on my part.
Thank you!
That if __name__ == '__main__': guard is important. On systems that don't use fork, it simulates a fork by importing the main script in each worker process without naming it __main__ (it's named __mp_main__ IIRC). Any code that should only run in the "main" script needs to be protected by that guard (it can be indirectly, by defining a function and calling it within the guarded segment; the function will be defined in the workers, but not run).
So to fix this, all you need to do is indent the test_input = input("test input") so it's protected by the if __name__ == '__main__': guard. In real code, I try to keep the guarded section clean (so I can't accidentally write functions that rely on global state that doesn't exist when it's not run as the main script, and for the mild performance benefits of using function locals over globals), so I'd write it like:
from multiprocessing import Process
def f(name):
print('hello', name)
def main():
p = Process(target=f, args=('bob',))
p.start()
p.join()
test_input = input("test input")
if __name__ == '__main__':
main()
but that's not strictly necessary.
I thought I would elaborate on ShadowRanger's answer:
On Windows systems new subprocesses are created by the following steps:
A new process is created wherein the Python interpreter is re-launched.
The Python interpreter re-interprets the current source program executing everything that is at global scope in order to compile function definitions, initialize global variables, etc.
Finally, your worker function, f in this case, is invoked with memory thus initialized.
The reason for placing the code that creates the subprocess within a block that is governed by if __name__ == '__main__': is that if you didn't, then because of Step 2 above you would get into a recursive, infinite loop creating new subprocesses ad inifinitum. The key point is that only in the main function will variable __name__ have the value '__main__'; it will have a different value for any subprocess that is created. And so the code that creates the new subprocess, i.e. p = Process(target=f, args=('bob',)), will not be executed as part of the initialization of the subprocess.
Your problem arises from the statement test_input = input("test input") being at global scope and not being within a if __name__ == '__main__': block and so it will be executed as part of the initialization of the subprocess. So your worker function, f, will not run until this prompt for input is satisfied and then when it returns your main process will be putting out the prompt again. Anyway, this is what I see when the program is run from a Windows command prompt. Perhaps with PyCharm there is a restriction against doing the input statement from any thread other than the main thread. But even if an exception is being thrown from that statement in creating the subprocess, I still don't quite see how your program would be looping continuously. Unfortunately, I do not have PyCharm installed.
Regarding to ShadowRanger answer, I think you should also put comma after 'bob'.
According to https://docs.python.org/3/library/multiprocessing.html
P should be like this if you want to put another statement.
p = Process(target=f, args=('bob',))

Python multiprocessing: Why does using Process run my program from the start?

I was having some trouble figuring out why my console would always print the print statements I had at the start of my file. Here's what it looks like:
from multiprocessing import Process
import time
print('hello') # why does this get printed over and over again?
def func1(num):
print(num ** 2)
time.sleep(1)
def func2(num):
print(num ** 3)
time.sleep(1)
if __name__ == '__main__':
counter = 0
while counter < 10:
proc1 = Process(target=func1, args=[2])
proc2 = Process(target=func2, args=[2])
proc1.start()
proc2.start()
proc1.join()
proc2.join()
counter += 1
once I run it: it prints "Hello" a every loop. I'm sure I'm just making a dumb mistake, but any help would be great, Thanks.
multiprocessing can fork an existing process or spawn a new process, depending on which options your operating system supports. On Windows, which can only spawn (execute a new process), a new instance of python is executed. That instance imports the module and then recreates your execution environment by expanding a pickled snapshot of your parent process. Theoretically, just enough to get the environment right for the subprocess.
In your case, print is at the module level so it is executed as part of the import in the subprocess. If this was the "__main__" module, you can simply put that print in the if __name__ == "__main__": clause. When its imported as a module instead of executed as a script, that print won't run.
If its not the main script module, well, that's messy. The general rule for modules is that they should be importable without side effects and that print is a side effect. Best to remove it in that case.

Python Multiprocessing within Jupyter Notebook

I am new to the multiprocessing module in Python and work with Jupyter notebooks. I have tried the following code snippet from PMOTW:
import multiprocessing
def worker():
"""worker function"""
print('Worker')
return
if __name__ == '__main__':
jobs = []
for i in range(5):
p = multiprocessing.Process(target=worker)
jobs.append(p)
p.start()
When I run this as is, there is no output.
I have also tried creating a module called worker.py and then importing that to run the code:
import multiprocessing
from worker import worker
if __name__ == '__main__':
jobs = []
for i in range(5):
p = multiprocessing.Process(target=worker)
jobs.append(p)
p.start()
There is still no output in that case. In the console, I see the following error (repeated multiple times):
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "C:\Program Files\Anaconda3\lib\multiprocessing\spawn.py", line 106, in spawn_main
exitcode = _main(fd)
File "C:\Program Files\Anaconda3\lib\multiprocessing\spawn.py", line 116, in _main
self = pickle.load(from_parent)
AttributeError: Can't get attribute 'worker' on <module '__main__' (built-in)>
However, I get the expected output when the code is saved as a Python script and exectued.
What can I do to run this code directly from the notebook without creating a separate script?
I'm relatively new to parallel computing so I may be wrong with some technicalities. My understanding is this:
Jupyter notebooks don't work with multiprocessing because the module pickles (serialises) data to send to processes.
multiprocess is a fork of multiprocessing that uses dill instead of pickle to serialise data which allows it to work from within Jupyter notebooks. The API is identical so the only thing you need to do is to change
import multiprocessing
to...
import multiprocess
You can install multiprocess very easily with a simple
pip install multiprocess
You will however find that your processes will still not print to the output, (although in Jupyter labs they will print out to the terminal the server out is running in). I stumbled upon this post trying to work around this and will edit this post when I find out how to.
I'm not an export either in multiprocessing or in ipykernel(which is used by jupyter notebook) but because there seems nobody gives an answer, I will tell you what I guessed. I hope somebody complements this later on.
I guess your jupyter notebook server is running on Windows host. In multiprocessing there are three different start methods. Let's focus on spawn, which is the default on windows, and fork, the default on Unix.
Here is a quick overview.
spawn
(cpython) interactive shell - always raise an error
run as a script - okay only if you nested multiprocessing code in if __name__ == '__main'__
Fork
always okay
For example,
import multiprocessing
def worker():
"""worker function"""
print('Worker')
return
if __name__ == '__main__':
multiprocessing.set_start_method('spawn')
jobs = []
for i in range(5):
p = multiprocessing.Process(target=worker)
jobs.append(p)
p.start()
This code works when it's saved and run as a script, but raises an error when entered in an python interactive shell. Here is the implementation of ipython kernel, and my guess is that that it uses some kind of interactive shell and so doesn't go well with spawn(but please don't trust me).
For a side note, I will give you an general idea of how spawn and fork are different. Each subprocess is running a different python interpreter in multiprocessing. Particularly, with spawn, a child process starts a new interpreter and imports necessary module from scratch. It's hard to import code in interactive shell, so it may raise an error.
fork is different. With fork, a child process copies the main process including most of the running states of the python interpreter and then continues execution. This code will help you understand the concept.
import os
main_pid = os.getpid()
os.fork()
print("Hello world(%d)" % os.getpid()) # print twice. Hello world(id1) Hello world(id2)
if os.getpid() == main_pid:
print("Hello world(main process)") # print once. Hello world(main process)
Much like you I encountered the attribute error. The problem seems to be related how jupyter handles multithreading. The fastest result I got was to follow the Multi-processing example.
So the ThreadPool took care of my issue.
from multiprocessing.pool import ThreadPool as Pool
def worker():
"""worker function"""
print('Worker\n')
return
pool = Pool(4)
for result in pool.map(worker, range(5)):
pass # or print diagnostics
This works for me on MAC (cannot make it work on windows):
import multiprocessing as mp
mp_start_count = 0
if __name__ == '__main__':
if mp_start_count == 0:
mp.set_start_method('fork')
mp_start_count += 1
Save the function to a separate Python file then import the function back in. It should work fine that way.

Is there any interpreter that works well for running multiprocessing in python 2.7 version on windows 7?

I was trying to run a piece of code.This code is all about multiprocessing.It works fine on command prompt and it also generates some output.But when I try to run this code on pyscripter it just says that script runs ok and it doesn't generate any output nor even it displays any error message.It doesn't even crashes.It would be really helpful if anyone could help me out to find out a right interpreter where this multiprocessing works fine.
Here is the piece of code:
from multiprocessing import Process
def wait():
print "wait"
clean()
def clean():
print "clean"
def main():
p=Process(target=wait)
p.start()
p.join()
if _name_=='_main_':
main()
The normal interpreter works just fine with multiprocessing on Windows 7 for me. (Your IDE might not like multiprocessing.)
You just have to do
if __name__=='__main__':
main()
with 2 underscores (__) each instead of 1 (_).
Also - if you don't have an actual reason not to use it, multiprocessing.Pool is much easier to use than multiprocessing.Process in most cases. Have alook at https://docs.python.org/2/library/multiprocessing.html#module-multiprocessing.pool
An implementation with a Pool would be
import multiprocessing
def wait():
print "wait"
clean()
def clean():
print "clean"
def main():
p=multiprocessing.Pool()
p.apply_async(wait)
p.close()
p.join()
if __name__=='__main__':
main()
but which method of Pool to use strongly depends on what you actually want to do.

os.sytem() in Python gives infinite loop

My main Python script imports 2 other scripts; Test1.py and Test2.py.
Test1.py does multiprocessing, and Test2.py does a simple os.system('ls') command. When Test1.py is finished and Test.py is called, os.system(ls) is going crazy and creates infinite new processes. Does anyone know why this happens?
# Main
import multiprocessing
import Test1.py
import Test2.py
def doSomething():
# Function 1, file1...file10 contain [name, path]
data = [file1, file2, file3, file4, file5, file6, file7, file8, file9, file10]
if __name__ == '__main__':
pool = multiprocessing.Pool(processes=min(len(data), 5))
print pool.map(Test1.worker, data)
# Function 2
Test2.worker()
Test1.py; calls perl commands
def worker(data):
command = 'perl '+data[1].split('data_')[0]+'methods_FastQC\\fastqc '+data[1]+'\\'+data[0]+'\\'+data[0]+' --outdir='+data[1]+'\\_IlluminaResults\\_fastqcAnalysis'
process = subprocess.Popen(command, stdout=subprocess.PIPE)
process.wait()
process.stdout.read()
Test2.py should do ONE simple ls command, instead it never stops making new commands;
def worker():
command = 'ls'
os.system(command)
When looking at the processes if script is started, it seems like the processes after function1 also don't close properly. Via the Taskmanager I still see 5 extra pythonw.exe which don't seem to do anything. Only when I close the opened shell they go away. Thats probably related to why os.system(command) goes crazy in function 2? Does anyone have a solution, since I can't close the shell because the script is not finished since it still has to do function2?
Edit: When trying to find a solution, it also happened that function1 started with executing the commands from function(2) multiple times, and after that the perl commands. Which is even more weird.
It seems doSomething() is executed every time your main module is imported and it can be imported several times by multiprocessing during the workers initialization. You could check it by printing process pid: print(os.getpid()) in Test2.worker().
You should use if __name__ == '__main__': at the module level. It is error-prone to do it inside a function as your code shows.
import multiprocessing
# ...
if __name__ == '__main__': # at global level
multiprocessing.freeze_support()
main() # it calls do_something() and everything else
See the very first note in the introduction to multiprocessing.

Categories