When using the library multiprocessing in Python, we know that if __name__ == '__main__' is necessary to fork children processes correctly from the parent. Otherwise there can be a RuntimeError like this:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
But is it possible to avoid if __name__ == '__main__'? What if I am writing a library that spawns multiple processes? Is it possible to eliminate the necessity for users to write if __name__ == '__main__' and deal with tedious details about multiprocessing?
I can write if __name__ == '__main__': inside my library, but in this way the user's code would run multiple times in my multiple processes.
Related
I'm trying to get started with multiprocessing, and I'm running into some interesting issues. The code I'm using is below (for the record, this example is straight from the multiprocessing documentation):
from multiprocessing import Process
def f(name):
print('hello', name)
if __name__ == '__main__':
p = Process(target=f, args=('bob'))
p.start()
p.join()
This works fine, and prints "hello bob" as it should. When I add any additional code to the file though, before or after the if statement, then p does not evaluate, and my file loops back to the beginning and runs all over again endlessly. For example, the following code gives me this issue:
from multiprocessing import Process
def f(name):
print('hello', name)
if __name__ == '__main__':
p = Process(target=f, args=('bob'))
p.start()
p.join()
test_input = input("test input")
I am running Python using Windows 10, Pycharm v. 2021.3.2, and Python 3.10.0. Is this an issue that any of you have seen before? At this point I'm starting to wonder if perhaps it's even an issue between Windows and Pycharm or Windows and Python, or maybe just a case of inexperience on my part.
Thank you!
That if __name__ == '__main__': guard is important. On systems that don't use fork, it simulates a fork by importing the main script in each worker process without naming it __main__ (it's named __mp_main__ IIRC). Any code that should only run in the "main" script needs to be protected by that guard (it can be indirectly, by defining a function and calling it within the guarded segment; the function will be defined in the workers, but not run).
So to fix this, all you need to do is indent the test_input = input("test input") so it's protected by the if __name__ == '__main__': guard. In real code, I try to keep the guarded section clean (so I can't accidentally write functions that rely on global state that doesn't exist when it's not run as the main script, and for the mild performance benefits of using function locals over globals), so I'd write it like:
from multiprocessing import Process
def f(name):
print('hello', name)
def main():
p = Process(target=f, args=('bob',))
p.start()
p.join()
test_input = input("test input")
if __name__ == '__main__':
main()
but that's not strictly necessary.
I thought I would elaborate on ShadowRanger's answer:
On Windows systems new subprocesses are created by the following steps:
A new process is created wherein the Python interpreter is re-launched.
The Python interpreter re-interprets the current source program executing everything that is at global scope in order to compile function definitions, initialize global variables, etc.
Finally, your worker function, f in this case, is invoked with memory thus initialized.
The reason for placing the code that creates the subprocess within a block that is governed by if __name__ == '__main__': is that if you didn't, then because of Step 2 above you would get into a recursive, infinite loop creating new subprocesses ad inifinitum. The key point is that only in the main function will variable __name__ have the value '__main__'; it will have a different value for any subprocess that is created. And so the code that creates the new subprocess, i.e. p = Process(target=f, args=('bob',)), will not be executed as part of the initialization of the subprocess.
Your problem arises from the statement test_input = input("test input") being at global scope and not being within a if __name__ == '__main__': block and so it will be executed as part of the initialization of the subprocess. So your worker function, f, will not run until this prompt for input is satisfied and then when it returns your main process will be putting out the prompt again. Anyway, this is what I see when the program is run from a Windows command prompt. Perhaps with PyCharm there is a restriction against doing the input statement from any thread other than the main thread. But even if an exception is being thrown from that statement in creating the subprocess, I still don't quite see how your program would be looping continuously. Unfortunately, I do not have PyCharm installed.
Regarding to ShadowRanger answer, I think you should also put comma after 'bob'.
According to https://docs.python.org/3/library/multiprocessing.html
P should be like this if you want to put another statement.
p = Process(target=f, args=('bob',))
I'm using multiprocessing in a larger code base where some of the import statements have side effects. How can I run a function in a background process without having it inherit global imports?
# helper.py:
print('This message should only print once!')
# main.py:
import multiprocessing as mp
import helper # This prints the message.
def worker():
pass # Unfortunately this also prints the message again.
if __name__ == '__main__':
mp.set_start_method('spawn')
process = mp.Process(target=worker)
process.start()
process.join()
Background: Importing TensorFlow initializes CUDA which reserves some amount of GPU memory. As a result, spawing too many processes leads to a CUDA OOM error, even though the processes don't use TensorFlow.
Similar question without an answer:
How to avoid double imports with the Python multiprocessing module?
Is there a resources that explains exactly what the multiprocessing
module does when starting an mp.Process?
Super quick version (using the spawn context not fork)
Some stuff (a pair of pipes for communication, cleanup callbacks, etc) is prepared then a new process is created with fork()exec(). On windows it's CreateProcessW(). The new python interpreter is called with a startup script spawn_main() and passed the communication pipe file descriptors via a crafted command string and the -c switch. The startup script cleans up the environment a little bit, then unpickles the Process object from its communication pipe. Finally it calls the run method of the process object.
So what about importing of modules?
Pickle semantics handle some of it, but __main__ and sys.modules need some tlc, which is handled here (during the "cleans up the environment" bit).
# helper.py:
print('This message should only print once!')
# main.py:
import multiprocessing as mp
def worker():
pass
def main():
# Importing the module only locally so that the background
# worker won't import it again.
import helper
mp.set_start_method('spawn')
process = mp.Process(target=worker)
process.start()
process.join()
if __name__ == '__main__':
main()
I was having some trouble figuring out why my console would always print the print statements I had at the start of my file. Here's what it looks like:
from multiprocessing import Process
import time
print('hello') # why does this get printed over and over again?
def func1(num):
print(num ** 2)
time.sleep(1)
def func2(num):
print(num ** 3)
time.sleep(1)
if __name__ == '__main__':
counter = 0
while counter < 10:
proc1 = Process(target=func1, args=[2])
proc2 = Process(target=func2, args=[2])
proc1.start()
proc2.start()
proc1.join()
proc2.join()
counter += 1
once I run it: it prints "Hello" a every loop. I'm sure I'm just making a dumb mistake, but any help would be great, Thanks.
multiprocessing can fork an existing process or spawn a new process, depending on which options your operating system supports. On Windows, which can only spawn (execute a new process), a new instance of python is executed. That instance imports the module and then recreates your execution environment by expanding a pickled snapshot of your parent process. Theoretically, just enough to get the environment right for the subprocess.
In your case, print is at the module level so it is executed as part of the import in the subprocess. If this was the "__main__" module, you can simply put that print in the if __name__ == "__main__": clause. When its imported as a module instead of executed as a script, that print won't run.
If its not the main script module, well, that's messy. The general rule for modules is that they should be importable without side effects and that print is a side effect. Best to remove it in that case.
i'm actually working on a script using multiprocessing library, everything is working perfectly from my text editor ( VSC ):
import multiprocessing
def example_func():
print("This is a targeted function for multiprocessing")
if __name__ == "__main__":
print("This is the main session, starting multiprocessing")
multiprocessing.Process(target=example_func).start()
so in my text editor when i run the code it output this :
This is the main session, starting multiprocessing
This is a targeted function for multiprocessing
but after I compile it to .exe using pyinstaller, something very strange happens, the code starts getting looped infinitely, its like if after i compiled it, the processes were considered as the main session, it means that in if __name__ == "__main__" processes were considered as main.
Please guys help, i really need your help.
EDIT : some guys told me to add string, I have already had it as a string in my script.I just didnt copy well here
You need to use multiprocessing.freeze_support appropriately when you're freezing to a Windows executable:
if __name__ == "__main__":
multiprocessing.freeze_support() # Required for PyInstaller
print("This is the main session, starting multiprocessing")
multiprocessing.Process(target=example_func).start()
Without it, the Windows "fork-like" behavior multiprocessing relies on doesn't know where to stop executing code when it launches subprocesses with the same executable.
It is a string:
if __name__ == "__main__":
pass
Note the double quotes instead of __main__ as an object (?).
I don't really know why it didn't throw an error there but the __main__ should be a string '__main__'
You have to compare __name__ to string __main__
import multiprocessing
def example_func():
print("This is a targeted function for multiprocessing")
if __name__ == "__main__":
print("This is the main session, starting multiprocessing")
multiprocessing.Process(target=example_func).start()
I have this code :
import os
pid = os.fork()
if pid == 0:
os.environ['HOME'] = "rep1"
external_function()
else:
os.environ['HOME'] = "rep2"
external_function()
and this code :
from multiprocessing import Process, Pipe
def f(conn):
os.environ['HOME'] = "rep1"
external_function()
conn.send(some_data)
conn.close()
if __name__ == '__main__':
os.environ['HOME'] = "rep2"
external_function()
parent_conn, child_conn = Pipe()
p = Process(target=f, args=(child_conn,))
p.start()
print parent_conn.recv()
p.join()
The external_function initializes an external programs by creating the necessary sub-directories in the directory found in the environment variable HOME. This function does this work only once in each process.
With the first example, which uses os.fork(), the directories are created as expected. But with second example, which uses multiprocessing, only the directories in rep2 get created.
Why isn't the second example creating directories in both rep1 and rep2?
The answer you are looking for is in detail addressed here. There is also an explanation of differences between different OS.
One big issue is that the fork system call does not exist on Windows. Therefore, when running a Windows OS you cannot use this method. multiprocessing is a higher-level interface to execute a part of the currently running program. Therefore, it - as forking does - creates a copy of your process current state. That is to say, it takes care of the forking of your program for you.
Therefore, if available you could consider fork() a lower-level interface to forking a program, and the multiprocessing library to be a higher-level interface to forking.
To answer your question directly, there must be some side effect of external_process that makes it so that when the code is run in series, you get different results than if you run them at the same time. This is due to how you set up your code, and the lack of differences between os.fork and multiprocessing.Process in systems that os.fork is supported.
The only real difference between the os.fork and multiprocessing.Process is portability and library overhead, since os.fork is not supported in windows, and the multiprocessing framework is included to make multiprocessing.Process work. This is because os.fork is called by multiprocessing.Process, as this answer backs up.
The important distinction, then, is os.fork copies everything in the current process using Unix's forking, which means at the time of forking both processes are the same with PID differences. In Window's, this is emulated by rerunning all the setup code before the if __name__ == '__main__':, which is roughly the same as creating a subprocess using the subprocess library.
For you, the code snippets you provide are doing fairly different things above, because you call external_function in main before you open the new process in the second code clip, making the two processes run in series but in different processes. Also the pipe is unnecessary, as it emulates no functionality from the first code.
In Unix, the code snippets:
import os
pid = os.fork()
if pid == 0:
os.environ['HOME'] = "rep1"
external_function()
else:
os.environ['HOME'] = "rep2"
external_function()
and:
import os
from multiprocessing import Process
def f():
os.environ['HOME'] = "rep1"
external_function()
if __name__ == '__main__':
p = Process(target=f)
p.start()
os.environ['HOME'] = "rep2"
external_function()
p.join()
should do exactly the same thing, but with a little extra overhead from the included multiprocessing library.
Without further information, we can't figure out what the issue is. If you can provide code that demonstrates the issue, that would help us help you.