Python windows multiprocessing with global imports

Python windows multiprocessing with global imports - python

I've been tasked with porting some Python code that works perfectly on Linux, to Windows. Unfortunately the code fails with the error:
An attempt has been made to start a new process before current
process has finished its bootstrapping phase.
From what I've been reading this is all due to Windows not using fork, and instead using spawn to create new threads.
This is where I'd post code, but there's a problem also, there is so much of it. One file imports another, which imports another, which starts threads, which has another class that also spawns a new thread.
So from what I've been reading, it's just a matter of using the if __name__ == '__main__': guard to prevent an infinite loop of threads from spawning, but I don't know where to put it.
The python file that I run already has this in place, but threads are spawned from other methods in other classes, so do I also need to put it into those classes?
It also seems that whomever wrote this has used a global file, which gets imported into each of the separate python files, and than in turn also imports other Python files. Sorry for not giving any code, but I really don't even know where to start with this one.
Any input or advice would be greatly appreciated.
Just to clear up any potential confusion, it appears this code is using process, not threads. So there is code like
import multiprocessing as mp
....
manager = mp.Manager()
...
process = mp.Process(blah)
process.start()
process.join()
....
process.terminate()
To add some more information, after making some minor changes, I've got the program to run. It's using Flask to to provide a REST API, that runs functions when it receives various HTTP requests.
It's currently throwing this exception when calling
manager = mp.Manager()
This line is part of the global.py file that is used in each of the other python files.
OK, after some modifications, I've now got past that problem, and onto another one! The processes are now starting, but the variables aren't available when imported from another class, for example
global.py
if __name__ == '__main__':
thingy = manager.dict()
handler.py
from global import *
if thingy['status'] == 'working':
**NameError: name 'thingy' is not defined**

Related

Different behavior of a Python script in VSCode on Windows 10 vs. Ubuntu 22.04 [duplicate]

This question already has answers here:
python multiprocessing on windows, if __name__ == "__main__"
(2 answers)
Closed 4 years ago.
While using multiprocessing in python on windows, it is expected to protect the entry point of the program.
The documentation says "Make sure that the main module can be safely imported by a new Python interpreter without causing unintended side effects (such a starting a new process)". Can anyone explain what exactly does this mean ?

Expanding a bit on the good answer you already got, it helps if you understand what Linux-y systems do. They spawn new processes using fork(), which has two good consequences:
All data structures existing in the main program are visible to the child processes. They actually work on copies of the data.
The child processes start executing at the instruction immediately following the fork() in the main program - so any module-level code already executed in the module will not be executed again.
fork() isn't possible in Windows, so on Windows each module is imported anew by each child process. So:
On Windows, no data structures existing in the main program are visible to the child processes; and,
All module-level code is executed in each child process.
So you need to think a bit about which code you want executed only in the main program. The most obvious example is that you want code that creates child processes to run only in the main program - so that should be protected by __name__ == '__main__'. For a subtler example, consider code that builds a gigantic list, which you intend to pass out to worker processes to crawl over. You probably want to protect that too, because there's no point in this case to make each worker process waste RAM and time building their own useless copies of the gigantic list.
Note that it's a Good Idea to use __name__ == "__main__" appropriately even on Linux-y systems, because it makes the intended division of work clearer. Parallel programs can be confusing - every little bit helps ;-)

The multiprocessing module works by creating new Python processes that will import your module. If you did not add __name__== '__main__' protection then you would enter a never ending loop of new process creation. It goes like this:
Your module is imported and executes code during the import that cause multiprocessing to spawn 4 new processes.
Those 4 new processes in turn import the module and executes code during the import that cause multiprocessing to spawn 16 new processes.
Those 16 new processes in turn import the module and executes code during the import that cause multiprocessing to spawn 64 new processes.
Well, hopefully you get the picture.
So the idea is that you make sure that the process spawning only happens once. And that is achieved most easily with the idiom of the __name__== '__main__' protection.

Python current.futures import libraries multiple times (execute code in top scope multiple times)

for the following script (python 3.6, windows anaconda), I noticed that the libraries are imported as many as the number of the processors were invoked. And print('Hello') are also executed multiple same amount of times.
I thought the processors will only be invoked for func1 call rather than the whole program. The actual func1 is a heavy cpu bounded task which will be executed for millions of times.
Is this the right choice of framework for such task?
import pandas as pd
import numpy as np
from concurrent.futures import ProcessPoolExecutor
print("Hello")
def func1(x):
return x
if __name__ == '__main__':
print(datetime.datetime.now())
print('test start')
with ProcessPoolExecutor() as executor:
results = executor.map(func1, np.arange(1,1000))
for r in results:
print(r)
print('test end')
print(datetime.datetime.now())

concurrent.futures.ProcessPoolExecutor uses the multiprocessing module to do its multiprocessing.
And, as explained in the Programming guidelines, this means you have to protect any top-level code you don't want to run in every process in your __main__ block:
Make sure that the main module can be safely imported by a new Python interpreter without causing unintended side effects (such a starting a new process).
... one should protect the “entry point” of the program by using if __name__ == '__main__':…
Notice that this is only necessary if using the spawn or forkserver start methods. But if you're on Windows, spawn is the default. And, at any rate, it never hurts to do this, and usually makes the code clearer, so it's worth doing anyway.
You probably don't want to protect your imports this way. After all, the cost of calling import pandas as pd once per core may seem nontrivial, but that only happens at startup, and the cost of running a heavy CPU-bound function millions of times will completely swamp it. (If not, you probably didn't want to use multiprocessing in the first place…) And usually, the same goes for your def and class statements (especially if they're not capturing any closure variables or anything). It's only setup code that's incorrect to run multiple times (like that print('hello') in your example) that needs to be protected.
The examples in the concurrent.futures doc (and in PEP 3148) all handle this by using the "main function" idiom:
def main():
# all of your top-level code goes here
if __name__ == '__main__':
main()
This has the added benefit of turning your top-level globals into locals, to make sure you don't accidentally share them (which can especially be a problem with multiprocessing, where they get actually shared with fork, but copied with spawn, so the same code may work when testing on one platform, but then fail when deployed on the other).
If you want to know why this happens:
With the fork start method, multiprocessing creates each new child process by cloning the parent Python interpreter and then just starting the pool-servicing function up right where you (or concurrent.futures) created the pool. So, top-level code doesn't get re-run.
With the spawn start method, multiprocessing creates each new child process by starting a clean new Python interpreter, importing your code, and then starting the pool-servicing function. So, top-level code gets re-run as part of the import.

Why is it important to protect the main loop when using joblib.Parallel?

The joblib docs contain the following warning:
Under Windows, it is important to protect the main loop of code to
avoid recursive spawning of subprocesses when using joblib.Parallel.
In other words, you should be writing code like this:
import ....
def function1(...):
...
def function2(...):
...
... if __name__ == '__main__':
# do stuff with imports and functions defined about
...
No code should run outside of the “if __name__ == ‘__main__’” blocks,
only imports and definitions.
Initially, I assumed this was just to prevent against the occasional odd case where a function passed to joblib.Parallel called the module recursively, which would mean it was generally good practice but often unnecessary. However, it doesn't make sense to me why this would only be a risk on Windows. Additionally, this answer seems to indicate that failure to protect the main loop resulted in the code running several times slower than it otherwise would have for a very simple non-recursive problem.
Out of curiosity, I ran the super-simple example of an embarrassingly parallel loop from the joblib docs without protecting the main loop on a windows box. My terminal was spammed with the following error until I closed it:
ImportError: [joblib] Attempting to do parallel computing without protecting your import on a system that does not suppo
rt forking. To use parallel-computing in a script, you must protect you main loop using "if __name__ == '__main__'". Ple
ase see the joblib documentation on Parallel for more information
My question is, what about the windows implementation of joblib requires the main loop to be protected in every case?
Apologies if this is a super basic question. I am new to the world of parallelization, so I might just be missing some basic concepts, but I couldn't find this issue discussed explicitly anywhere.
Finally, I want to note that this is purely academic; I understand why it is generally good practice to write one's code in this way, and will continue to do so regardless of joblib.

This is necessary because Windows doesn't have fork(). Because of this limitation, Windows needs to re-import your __main__ module in all the child processes it spawns, in order to re-create the parent's state in the child. This means that if you have the code that spawns the new process at the module-level, it's going to be recursively executed in all the child processes. The if __name__ == "__main__" guard is used to prevent code at the module scope from being re-executed in the child processes.
This isn't necessary on Linux because it does have fork(), which allows it to fork a child process that maintains the same state of the parent, without re-importing the __main__ module.

In case someone stumbles across this in 2021:
Due to the new backend "loky" used by joblib>0.12 protecting the main for loop is no longer required. See https://joblib.readthedocs.io/en/latest/parallel.html

How to use python (maya) multithreading

I've been looking at examples from other people but I can't seem to get it to work properly.
It'll either use a single core, or basically freeze up maya if given too much to process, but I never seem to get more than one core working at once.
So for example, this is kind of what I'd like it to do, on a very basic level. Mainly just let each loop run simultaneously on a different processor with the different values (in this case, the two values would use two processors)
mylist = [50, 100, 23]
newvalue = [50,51]
for j in range(0, len(newvalue)):
exists = False
for i in range(0, len(mylist)):
#search list
if newvalue[j] == mylist[i]:
exists = True
#add to list
if exists == True:
mylist.append(mylist)
Would it be possible to pull this off? The actual code I'm wanting to use it on can take from a few seconds to like 10 minutes for each loop, but they could theoretically all run at once, so I thought multithreading would speed it up loads
Bear in mind I'm still relatively new to python so an example would be really appreciated
Cheers :)

There are really two different answers to this.
Maya scripts are really supposed to run in the main UI thread, and there are lots of ways they can trip you up if run from a separate thread. Maya includes a module called maya.utils which includes methods for deferred evaluation in the main thread. Here's a simple example:
import maya.cmds as cmds
import maya.utils as utils
import threading
def do_in_main():
utils.executeDeferred (cmds.sphere)
for i in range(10):
t = threading.Thread(target=do_in_main, args=())
t.start()
That will allow you to do things with the maya ui from a separate thread (there's another method in utils that will allow the calling thread to await a response too). Here's a link to the maya documentation on this module
However, this doesn't get you around the second aspect of the question. Maya python isn't going to split up the job among processors for you: threading will let you create separate threads but they all share the same python intepreter and the global interpreter lock will mean that they end up waiting for it rather than running along independently.
You can't use the multiprocessing module, at least not AFAIK, since it spawns new mayas rather than pushing script execution out into other processors in the Maya you are running within. Python aside, Maya is an old program and not very multi-core oriented in any case. Try XSI :)
Any threading stuff in Maya is tricky in any case - if you touch the main application (basically, any function from the API or a maya.whatever module) without the deferred execution above, you'll probably crash maya. Only use it if you have to.
And, BTW, you cant use executeDeferred, etc in batch mode since they are implemented using the main UI loop.

What theodox says is still true today, six years later. However one may go another route by spawning a new process by using the subprocess module. You'll have to communicate and share data via sockets or something similar since the new process is in a seperate interpreter. The new interpreter runs on its own and doesn't know about Maya but you can do any other work in it benefitting from the multi-threaded environment your OS provides before communicating it back to your Maya python script.

python multiprocessing Pool: processes spawn new copies of main application

I'm working on a larger size Python application which runs a solver on several hundred distinct problem scenarios. There is a GUI that allows the user to set up the solver configuration. In an attempt to speed this up I have implemented a multiprocessing Pool to spawn new instances of the solver module within the application.
What ends up happening is that during the pool's creation four new copies of the GUI appear, which is entirely not what I'm looking to have happen. I've taken what I thought were the appropriate steps in protecting the entry point of the application as per the programming guidelines but perhaps I've misunderstood something fundamental about the multiprocessing module.
I've followed the guideline in this thread in creating a minimal startup module.
ScenarioSolver.solveOneScenario creates a new instance of the solver and scenarios_to_solve is a list of arguments.
process_pool = multiprocessing.Pool(4)
for _, result in enumerate(process_pool.imap_unordered(ScenarioSolver.solveOneScenario, scenarios_to_solve)):
self.processResult(result)
So, based on the limited information here, what might I have overlooked in using the Pool?
EDIT: This behavour only happens when I package the application into an executable using py2exe. When running from eclipse I get the intended behaviour.

This is the same problem as solved in this thread.
Adding multiprocessing.freeze_support() immediately after if __name__ == '__main__' solved this problem.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.