I'm trying to parallelize a for loop to speed-up my code, since the loop processing operations are all independent. Following online tutorials, it seems the standard multiprocessing library in Python is a good start, and I've got this working for basic examples.
However, for my actual use case, I find that parallel processing (using a dual core machine) is actually a little (<5%) slower, when run on Windows. Running the same code on Linux, however, results in a parallel processing speed-up of ~25%, compared to serial execution.
From the docs, I believe this may relate to Window's lack of fork() function, which means the process needs to be initialised fresh each time. However, I don't fully understand this and wonder if anyone can confirm this please?
Particularly,
--> Does this mean that all code in the calling python file gets run for each parallel process on Windows, even initialising classes and importing packages?
--> If so, can this be avoided by somehow passing a copy (e.g. using deepcopy) of the class into the new processes?
--> Are there any tips / other strategies for efficient parallelisation of code design for both unix and windows.
My exact code is long and uses many files, so I have created a pseucode-style example structure which hopefully shows the issue.
# Imports
from my_package import MyClass
imports many other packages / functions
# Initialization (instantiate class and call slow functions that get it ready for processing)
my_class = Class()
my_class.set_up(input1=1, input2=2)
# Define main processing function to be used in loop
def calculation(_input_data):
# Perform some functions on _input_data
......
# Call method of instantiate class to act on data
return my_class.class_func(_input_data)
input_data = np.linspace(0, 1, 50)
output_data = np.zeros_like(input_data)
# For Loop (SERIAL implementation)
for i, x in enumerate(input_data):
output_data[i] = calculation(x)
# PARALLEL implementation (this doesn't work well!)
with multiprocessing.Pool(processes=4) as pool:
results = pool.map_async(calculation, input_data)
results.wait()
output_data = results.get()
EDIT: I do not believe the question is a duplicate of the one suggested, since this relates to a difference in Windows and Linunx, which is not mentioned at all in the suggested duplicate question.
NT Operating Systems lack the UNIX fork primitive. When a new process is created, it starts as a blank process. It's responsibility of the parent to instruct the new process on how to bootstrap.
Python multiprocessing APIs abstracts the process creation trying to give the same feeling for the fork, forkserver and spawn start methods.
When you use the spawn starting method, this is what happens under the hood.
A blank process is created
The blank process starts a brand new Python interpreter
The Python interpreter is given the MFA (Module Function Arguments) you specified via the Process class initializer
The Python interpreter loads the given module resolving all the imports
The target function is looked up within the module and called with the given args and kwargs
The above flow brings few implications.
As you noticed yourself, it is a much more taxing operation compared to fork. That's why you notice such a difference in performance.
As the module gets imported from scratch in the child process, all import side effects are executed anew. This means that constants, global variables, decorators and first level instructions will be executed again.
On the other side, initializations made during the parent process execution will not be propagated to the child. See this example.
This is why in the multiprocessing documentation they added a specific paragraph for Windows in the Programming Guidelines. I highly recommend to read the Programming Guidelines as they already include all the required information to write portable multi-processing code.
Related
I have the following problem, Lets have this python function
def func():
run some code here which calls some native code
Inside func() I am calling some functions which in turn calls some native C code.
If any crash happens the whole python process crashes alltoghether.
How is possible to catch and recover from such errors?
One way that came to my mind is run this function in a separate process, but not just starting another process because there is a lot of memory and objects used by the function, will be very hard to split that. Is there something like fork() in C available in python, to create a copy of the same exact process with same memory structures and etc?
Or maybe other ideas?
Update:
It seems that there is no real way of catching the C runtime errors in python, those are at a lower level and crashes the whole Python virtual machine.
As solutions you currently have two options:
Use os.fork() but work only in unix like OS env.
Use multiprocessing and a shared memory model to share big objects between processes. Usual serialization will just not work with objects that have multi-gigabytes in memory (you will just run out of memory). However there is a very good python library called Ray (https://docs.ray.io/en/master/) that performs in-memory big objects serialization using shared memory model and it's ideal for BigData/ML workloads - highly recommended.
As long as you are running on an operating system that supports fork that's already how the multiprocessing module creates subprocesses. You could os.fork, multiprocessing.Process or multiprocessing.Pool to get what you want. You can also use the os.fork() call on these systems.
On Windows, Python (2)'s standard library routine subprocess.Popen allows you to specify arbitrary flags to CreateProcess, and you can access the process handle for the newly-created process from the object that Popen returns. However, the thread handle for the newly-created process's initial thread is closed by the library before Popen returns.
Now, I need to create a process suspended (CREATE_SUSPENDED in creation flags) so that I can manipulate it (specifically, attach it to a job object) before it has a chance to execute any code. However, that means I need the thread handle in order to release the process from suspension (using ResumeThread). The only way I can find, to recover the thread handle, is to use the "tool help" library to walk over all threads on the entire system (e.g. see this question and answer). This works, but I do not like it. Specifically, I am concerned that taking a snapshot of all the threads on the system every time I need to create a process will be too expensive. (The larger application is a test suite, using processes for isolation; it creates and destroys processes at a rate of tens to hundreds a second.)
So, the question is: is there a more efficient way to resume execution of a process that was suspended by CREATE_SUSPENDED, if all you have is the process handle, and the facilities of the Python 2 standard library (including ctypes, but not the winapi add-on)? Vista-and-higher techniques are acceptable, but XP compatibility is preferred.
I have found a faster approach; unfortunately it relies on an undocumented API, NtResumeProcess. This does exactly what it sounds like - takes a process handle and applies the equivalent of ResumeThread to every thread in the process. Python/ctypes code to use it looks something like
import ctypes
from ctypes.wintypes import HANDLE, LONG, ULONG
ntdll = ctypes.WinDLL("ntdll.dll")
RtlNtStatusToDosError = ntdll.RtlNtStatusToDosError
NtResumeProcess = ntdll.NtResumeProcess
def errcheck_ntstatus(status, *etc):
if status < 0: raise ctypes.WinError(RtlNtStatusToDosError(status))
return status
RtlNtStatusToDosError.argtypes = (LONG,)
RtlNtStatusToDosError.restype = ULONG
# RtlNtStatusToDosError cannot fail
NtResumeProcess.argtypes = (HANDLE,)
NtResumeProcess.restype = LONG
NtResumeProcess.errcheck = errcheck_ntstatus
def resume_subprocess(proc):
NtResumeProcess(int(proc._handle))
I measured approximately 20% less process setup overhead using this technique than using Toolhelp, on an otherwise-idle Windows 7 virtual machine. As expected given how Toolhelp works, the performance delta gets bigger the more threads exist on the system -- whether or not they have anything to do with the program in question.
Given the obvious general utility of NtResumeProcess and its counterpart NtSuspendProcess, I am left wondering why they have never been documented and given kernel32 wrappers. They are used by a handful of core system DLLs and EXEs all of which, AFAICT, are part of the Windows Error Reporting mechanism (faultrep.dll, werui.dll, werfault.exe, dwwin.exe, etc) and don't appear to re-expose the functionality under documented names. It seems unlikely that these functions would change their semantics without also changing their names, but a defensively-coded program should probably be prepared for them to disappear (falling back to toolhelp, I suppose).
I'm posting this here, because I found something that addresses this question. I'm looking into this myself and I believe that I've found the solution with this.
I can't give you an excerpt or a summary, because it's just too much and I found it just two hours ago. I'm posting this here for all the others who, like me, seek a way to "easily" spawn a proper child process in windows, but want to execute a cuckoo instead. ;)
The whole second chapter is of importance, but the specifics start at page 12.
http://lsd-pl.net/winasm.pdf
I hope that it helps others as much as it hopefully going to help me.
Edit:
I guess I can add more to it. From what I've gathered, does this document explain how to spawn a sleeping process which never gets executed. This way we have a properly set-up windows process running. Then it explains that by using the win32api functions VirtualAllocEx and WriteProcessMemory, we can easily allocate executable pages and inject machine code into the other process.
Then - the best part in my opinion - it's possible to change the registers of the process, allowing the programmer to change the instruction pointer to point at the cuckoo!
Amazing!
I have a function that uses multiprocessing (specifically joblib) to speed up a slow routine using multiple cores. It works great; no questions there.
I have a test suite that uses multiprocessing (currently just the multiprocessing.Pool() system, but can change it to joblib) to run each module's test functions independently. It works great; no questions there.
The problem is that I've now integrated the multiprocessing function into the module's test suite, so that the pool process runs the multiprocessing function. I would like to make it so that the inner function knows that it is already being multiprocessed and not spin up more forks of itself. Currently the inner process sometimes hangs, but even if it doesn't, obviously there are no gains to multiprocessing within an already parallel routine.
I can think of several ways (with lock files, setting some sort of global variable, etc.) to determine the state we're in, but I'm wondering if there is some standard way of figuring this out (either in PY multiprocessing or in joblib). If it only works in PY3, that'd be fine, though obviously solutions that also work on 2.7 or lower would be better. Thanks!
Parallel in joblib should be able to sort these things out:
http://pydoc.net/Python/joblib/0.8.3-r1/joblib.parallel/
Two pieces from 0.8.3-r1:
# Set an environment variable to avoid infinite loops
os.environ[JOBLIB_SPAWNED_PROCESS] = '1'
Don't know why they go from a variable referring to the environmental, to the env. itself.. But as you can see. The feature is already implemented in joblib.
# We can now allow subprocesses again
os.environ.pop('__JOBLIB_SPAWNED_PARALLEL__', 0)
Here you can select other versions, if that's more relevant:
http://pydoc.net/Python/joblib/0.8.3-r1/
The answer to the specific question is: I don't know of a ready-made utility.
A minimal(*) core refactoring would to be add a named parameter to your function currently creating child processes. The default parameter would be your current behavior, and an other value would switch to a behavior compatible with how you are running tests(**).
(*: there might be other, may be better, design alternatives to consider but we do not have enough information)
(**: one may say that the introduction of a conditional behavior would require to test that as well, and we are back to square one...)
Look at the current state of
import multiprocessing
am_already_spawned = multiprocessing.current_process().daemon
am_already_spawned will be True if the current_process is a spawned process (and thus won't benefit from more multiprocessing) and False otherwise.
The joblib docs contain the following warning:
Under Windows, it is important to protect the main loop of code to
avoid recursive spawning of subprocesses when using joblib.Parallel.
In other words, you should be writing code like this:
import ....
def function1(...):
...
def function2(...):
...
... if __name__ == '__main__':
# do stuff with imports and functions defined about
...
No code should run outside of the “if __name__ == ‘__main__’” blocks,
only imports and definitions.
Initially, I assumed this was just to prevent against the occasional odd case where a function passed to joblib.Parallel called the module recursively, which would mean it was generally good practice but often unnecessary. However, it doesn't make sense to me why this would only be a risk on Windows. Additionally, this answer seems to indicate that failure to protect the main loop resulted in the code running several times slower than it otherwise would have for a very simple non-recursive problem.
Out of curiosity, I ran the super-simple example of an embarrassingly parallel loop from the joblib docs without protecting the main loop on a windows box. My terminal was spammed with the following error until I closed it:
ImportError: [joblib] Attempting to do parallel computing without protecting your import on a system that does not suppo
rt forking. To use parallel-computing in a script, you must protect you main loop using "if __name__ == '__main__'". Ple
ase see the joblib documentation on Parallel for more information
My question is, what about the windows implementation of joblib requires the main loop to be protected in every case?
Apologies if this is a super basic question. I am new to the world of parallelization, so I might just be missing some basic concepts, but I couldn't find this issue discussed explicitly anywhere.
Finally, I want to note that this is purely academic; I understand why it is generally good practice to write one's code in this way, and will continue to do so regardless of joblib.
This is necessary because Windows doesn't have fork(). Because of this limitation, Windows needs to re-import your __main__ module in all the child processes it spawns, in order to re-create the parent's state in the child. This means that if you have the code that spawns the new process at the module-level, it's going to be recursively executed in all the child processes. The if __name__ == "__main__" guard is used to prevent code at the module scope from being re-executed in the child processes.
This isn't necessary on Linux because it does have fork(), which allows it to fork a child process that maintains the same state of the parent, without re-importing the __main__ module.
In case someone stumbles across this in 2021:
Due to the new backend "loky" used by joblib>0.12 protecting the main for loop is no longer required. See https://joblib.readthedocs.io/en/latest/parallel.html
I'm working on a larger size Python application which runs a solver on several hundred distinct problem scenarios. There is a GUI that allows the user to set up the solver configuration. In an attempt to speed this up I have implemented a multiprocessing Pool to spawn new instances of the solver module within the application.
What ends up happening is that during the pool's creation four new copies of the GUI appear, which is entirely not what I'm looking to have happen. I've taken what I thought were the appropriate steps in protecting the entry point of the application as per the programming guidelines but perhaps I've misunderstood something fundamental about the multiprocessing module.
I've followed the guideline in this thread in creating a minimal startup module.
ScenarioSolver.solveOneScenario creates a new instance of the solver and scenarios_to_solve is a list of arguments.
process_pool = multiprocessing.Pool(4)
for _, result in enumerate(process_pool.imap_unordered(ScenarioSolver.solveOneScenario, scenarios_to_solve)):
self.processResult(result)
So, based on the limited information here, what might I have overlooked in using the Pool?
EDIT: This behavour only happens when I package the application into an executable using py2exe. When running from eclipse I get the intended behaviour.
This is the same problem as solved in this thread.
Adding multiprocessing.freeze_support() immediately after if __name__ == '__main__' solved this problem.