Workaround for using __name__=='__main__' in Python multiprocessing

Workaround for using __name__=='__main__' in Python multiprocessing - python

As we all know we need to protect the main() when running code with multiprocessing in Python using if __name__ == '__main__'.
I understand that this is necessary in some cases to give access to functions defined in the main but I do not understand why this is necessary in this case:
file2.py
import numpy as np
from multiprocessing import Pool
class Something(object):
def get_image(self):
return np.random.rand(64,64)
def mp(self):
image = self.get_image()
p = Pool(2)
res1 = p.apply_async(np.sum, (image,))
res2 = p.apply_async(np.mean, (image,))
print(res1.get())
print(res2.get())
p.close()
p.join()
main.py
from file2 import Something
s = Something()
s.mp()
All of the functions or imports necessary for Something to work are part of file2.py. Why does the subprocess need to re-run the main.py?
I think the __name__ solution is not very nice as this prevents me from distribution the code of file2.py as I can't make sure they are protecting their main.
Isn't there a workaround for Windows?
How are packages solving that (as I never encountered any problem not protecting my main with any package - are they just not using multiprocessing?)
edit:
I know that this is because of the fork() not implemented in Windows. I was just asking if there is a hack to let the interpreter start at file2.py instead of main.py as I can be sure that file2.py is self-sufficient

When using the "spawn" start method, new processes are Python interpreters that are started from scratch. It's not possible for the new Python interpreters in the subprocesses to figure out what modules need to be imported, so they import the main module again, which in turn will import everything else. This means it must be possible to import the main module without any side effects.
If you are on a different platform than Windows, you can use the "fork" start method instead, and you won't have this problem.
That said, what's wrong with using if __name__ == "__main__":? It has a lot of additional benefits, e.g. documentation tools will be able to process your main module, and unit testing is easier etc, so you should use it in any case.

As others have mentioned the spawn() method on Windows will re-import the code for each instance of the interpreter. This import will execute your code again in the child process (and this will make it create it own child, and so on).
A workaround is to pull the multiprocessing script into a separate file and then use subprocess to launch it from the main script.
I pass variables into the script by pickling them in a temporary directory, and I pass the temporary directory into the subprocess with argparse.
I then pickle the results into the temporary directory, where the main script retrieves them.
Here is an example file_hasher() function that I wrote:
main_program.py
import os, pickle, shutil, subprocess, sys, tempfile
def file_hasher(filenames):
try:
subprocess_directory = tempfile.mkdtemp()
input_arguments_file = os.path.join(subprocess_directory, 'input_arguments.dat')
with open(input_arguments_file, 'wb') as func_inputs:
pickle.dump(filenames, func_inputs)
current_path = os.path.dirname(os.path.realpath(__file__))
file_hasher = os.path.join(current_path, 'file_hasher.py')
python_interpreter = sys.executable
proc = subprocess.call([python_interpreter, file_hasher, subprocess_directory],
timeout=60,
)
output_file = os.path.join(subprocess_directory, 'function_outputs.dat')
with open(output_file, 'rb') as func_outputs:
hashlist = pickle.load(func_outputs)
finally:
shutil.rmtree(subprocess_directory)
return hashlist
file_hasher.py
#! /usr/bin/env python
import argparse, hashlib, os, pickle
from multiprocessing import Pool
def file_hasher(input_file):
with open(input_file, 'rb') as f:
data = f.read()
md5_hash = hashlib.md5(data)
hashval = md5_hash.hexdigest()
return hashval
if __name__=='__main__':
argument_parser = argparse.ArgumentParser()
argument_parser.add_argument('subprocess_directory', type=str)
subprocess_directory = argument_parser.parse_args().subprocess_directory
arguments_file = os.path.join(subprocess_directory, 'input_arguments.dat')
with open(arguments_file, 'rb') as func_inputs:
filenames = pickle.load(func_inputs)
hashlist = []
p = Pool()
for r in p.imap(file_hasher, filenames):
hashlist.append(r)
output_file = os.path.join(subprocess_directory, 'function_outputs.dat')
with open(output_file, 'wb') as func_outputs:
pickle.dump(hashlist, func_outputs)
There must be a better way...

The main module is imported (but with __name__ != '__main__' because Windows is trying to simulate a forking-like behavior on a system that doesn't have forking). multiprocessing has no way to know that you didn't do anything important in you main module, so the import is done "just in case" to create an environment similar to the one in your main process. If it didn't do this, all sorts of stuff that happens by side-effect in main (e.g. imports, configuration calls with persistent side-effects, etc.) might not be properly performed in the child processes.
As such, if they're not protecting their __main__, the code is not multiprocessing safe (nor is it unittest safe, import safe, etc.). The if __name__ == '__main__': protective wrapper should be part of all correct main modules. Go ahead and distribute it, with a note about requiring multiprocessing-safe main module protection.

the if __name__ == '__main__' is needed on windows since windows doesnt have a "fork" option for processes.
In linux, for example, you can fork the process, so the parent process will be copied and the copy will become the child process (and it will have access to the already imported code you had loaded in the parent process)
Since you cant fork in windows, python simply imports all the code that was imported by the parent process, in the child process. This creates a similar effect, but if you dont do the __name__ trick, this import will execute your code again in the child process (and this will make it create it own child, and so on).
so even in your example main.py will be imported again (since all the files are imported again). python cant guess what specific python script the child process should import.
FYI there are other limitations you should be aware of like using globals, you can read about it here https://docs.python.org/2/library/multiprocessing.html#windows

Related

How to handle logging and imports when running a python module as the main script?

I have a python library/package that contains modules with functions & classes I usually import inside of my application's main script. However, these modules each contain their own 'main()' function so that I can run them individually if needed. Here's an example:
# my_module.py
import logging
log = logging.getLogger(__name__)
class SomeModuleClass:
def __init__(self):
pass
def some_module_function():
pass
def main():
# Some code when running the module as the main script
if __name__ == "__main__":
main()
I have two questions related to running this module as the main script (i.e. running python my_module.py):
What's the cleanest way to use the logging module in this case? I usually simply overwrite the global 'log' variable by including the following in the main() function, but I'm not sure if there's a more pythonic way of doing this:
def main():
# ... some logging setup
logging.config.dictConfig(log_cfg_dictionary)
global log
log = logging.getLogger(__name__)
Is it considered good practice to include import statements inside of the main() function for all libraries only needed when running the script as main? For instance, one might need to use argparse, json, logging.config, etc. inside of the main function, but these libraries are not used anywhere else in my module. I know the most pythonic way is probably to keep all the import statements at the top of the file (especially if there's no significant difference in terms of memory usage or performance), but keeping main function-specific imports inside of main() remains readable & looks cleaner in my opinion.

Transferring modules between two processes with python multiprocessing

So i have a problem. I'm trying to make my imports faster, so i started using multiprocessing module to split a group of imports into two functions, and then run each on separate core, thus speeding the imports up. But now the code will not recognize the modules at all. What am I doing wrong ?
import multiprocessing
def core1():
import wikipedia
import subprocess
import random
return wikipedia, subprocess, random
def core2():
from urllib import request
import json
import webbrowser
return request, json, webbrowser
if __name__ == "__main__":
start_core_1 = multiprocessing.Process(name='worker 1', target=core1, args = core2())
start_core_2 = multiprocessing.Process(name='worker 2', target=core2, args = core1())
start_core_1.start()
start_core_2.start()
while True:
user = input('[!] ')
with request.urlopen('https://api.wit.ai/message?v=20160511&q=%s&access_token=Z55PIVTSSFOETKSBPWMNPE6YL6HVK4YP' % request.quote(user)) as wit_api: # call to wit.ai api
wit_api_html = wit_api.read()
wit_api_html = wit_api_html.decode()
wit_api_data = json.loads(wit_api_html)
intent = wit_api_data['entities']['Intent'][0]['value']
term = wit_api_data['entities']['search_term'][0]['value']
if intent == 'info_on':
with request.urlopen('https://kgsearch.googleapis.com/v1/entities:search?query=%s&key=AIzaSyCvgNV4G7mbnu01xai0f0k9NL2ito8vY6s&limit=1&indent=True' % term.replace(' ', '%20')) as response:
google_knowledge_base_html = response.read()
google_knowledge_base_html = google_knowledge_base_html.decode()
google_knowledge_base_data = json.loads(google_knowledge_base_html)
print(google_knowledge_base_data['itemListElement'][0]['result']['detailedDescription']['articleBody'])
else:
print('Something')

I think you are missing the important parts of the whole picture i.e. crucial parts of what you need to know about multiprocessing when using it.
Here are some crucial parts that you have to know and then you will understand why you can't just import modules in child process and speed up the thing. Even returning loaded modules is not a perfect answer too.
First, when you use multiprocess.Process a child process is forked (on Linux) or spawned (on Windows). I'll assume you are using Linux. In that case, every child process inherits every loaded module from parent (global state). When child process changes anything, like global variables or imports new modules, those stay just in its context. So, parent process is not aware of it. I believe part of this can also be of interest.
Second, module can be a set of classes, external lib bindings, functions, etc. and some of them quite probably can't be pickled, at least with pickle. Here is the list of what can be pickled in Python 2.7 and in Python 3.X. There are even libraries that give you 'more pickling power' like dill. However, I'm not sure pickling whole modules is a good idea at all, not to mention that you have slow imports and yet you want to serialize them and send them to parent process. Even if you manage to do it, it doesn't sound like a best approach.
Some of the ideas on how to change the perspective:
Try to revise which module you need and why? Maybe you can use other modules that can give you similar functionalities. Maybe these modules are overweighing and bringing too much with them and cost is great in comparing to what you get.
If you have slow loading of modules, try to make a script that will always be running, so you do not have to run it multiple times.
If you really need those modules maybe you can separate their using in two processes and then each process does it's own thing. Example would be, one process parses page, other process processes and so on. That way you sped up the loading but you have to deal with passing messages between processes.

What happens to imports when a new process is spawned?

What happens to imported modules variables when a new process is spawned?
IE
with concurrent.futures.ProcessPoolExecutor(max_workers=settings.MAX_PROCESSES) as executor:
for stuff in executor.map(foo, paths):
where:
def foo(str):
x = someOtherModule.fooBar()
where foobar is accessing things declared at the start of someOtherModule:
someOtherModule.py:
myHat='green'
def fooBar():
return myHat
Specifically, I have a module (called Y) that has a py4j gateway initialized at the top, outside of any function. In module X I'm loading several files at once, and the function that sorts through the data after loading uses a function in Y which in turn uses the gateway.
Is this design pythonic?
Should I be importing my Y module after each new process is spawned? OR is there a better way to do this?

On Linux, fork will be used to spawn the child, so anything in the global scope of the parent will also be available in the child, with copy-on-write semantics.
On Windows, anything you import at the module-level in the __main__ module of the parent process will get re-imported in the child.
This means that if you have a parent module (let's call it someModule) like this:
import someOtherModule
import concurrent.futures
def foo(str):
x = someOtherModule.fooBar()
if __name__ == "__main__":
with concurrent.futures.ProcessPoolExecutor(max_workers=settings.MAX_PROCESSES) as executor:
for stuff in executor.map(foo, paths):
# stuff
And someOtherModule looks like this:
myHat='green'
def fooBar():
return myHat
In this example, someModule is the __main__ module of the script. So, on Linux, the myHat instance you get in the child will be a copy-on-write version of the one in someModule. On Windows, each child process will re-import someModule as soon as they load, which will result in someOtherModule being re-imported as well.
I don't know enough about py4j Gateway objects to tell if you for sure if this is the behavior you want or not. If the Gateway object is pickleable, you could explicitly pass it to each child instead, but you'd have to use a multiprocessing.Pool instead of concurrent.futures.ProcessPoolExecutor:
import someOtherModule
import multiprocessing
def foo(str):
x = someOtherModule.fooBar()
def init(hat):
someOtherModule.myHat = hat
if __name__ == "__main__":
hat = someOtherModule.myHat
pool = multiprocessing.Pool(settings.MAX_PROCESSES,
initializer=init, initargs=(hat,))
for stuff in pool.map(foo, paths):
# stuff
It doesn't seem like you have a need to do this for you use-case, though. You're probably fine using the re-import.

When you create a new process, a fork() is called, which clones the entire process and stack, memory space etc. This is why multi-processing is considered more expensive than multi-threading since the copying is expensive.
So to answer your question, all "imported module variables" are cloned. You can modify them as you wish, but your original parent process won't see this change.
EDIT:
This for Unix based systems only. See Dano's answer for Unix+Windows answer.

Python subprocess.Popen : sys.stdout vs .txt file vs Cpickle.dump

I would like to know what is the best practice when you want to "return" something from a python script.
Here is my problem. I'm running a Python childScript from a parentScript using the subprocess.Popen method. I would like to get a tuple of two floats from the execution of the first script.
Now, the first method I have seen is by using sys.stdout and a pipe in the subprocess function as follow:
child.py:
if __name__ == '__main__':
myTuple = (x,y)
sys.stdout.write(str(myTuple[0]) +":"+str(myTuple[1]))
sys.stdout.flush()
parent.py:
p = subprocess.Popen([python, "child.py"], stdout=subprocess.PIPE)
out, err = p.communicate()
Though here it says that it is not recommended in most cases but I don't know why...
The second way would be to write my tuple into a text file in Script1.py and open it in Script2.py. But I guess writing and reading file takes a bit of time so I don't know if it is a better way to do?
Finally, I could use CPickle and dump my tuple and open it from script2.py. I guess that would be a bit faster than using a text file but would it be better than using sys.stdout?
What would be the proper way to do?
---------------------------------------EDIT------------------------------------------------
I forgot to mention that I cannot use import since parent.py actually generates child.py in a folder. Indeed I am doing some multiprocessing.
Parent.py creates say 10 directories where child.py is copied in each of them. Then I run each of the child.py from parent.py on several processors. And I want parent.py to gather the results "returned" by all the child.py. So parent.py cannot import child.py since it is not generated yet, or maybe I can do some sort of dynamic import? I don't know...
---------------------------------------EDIT2-----------------------------------------------
Another edit to answer a question with regards to why I proceed this way. Child.py actually calls ironpython and another script to run a .Net assembly. The reason why I HAVE to copy all the child.py files in specific folders is because this assembly generates a resource file which is then used by itself. If I don't copy child.py (and the assembly by the way) in each subfolders the resource files are copied at the root which creates conflicts when I call several processes using the multiprocessing module. If you have some suggestions about this overall architecture it is more than welcome :).
Thanks

Ordinary, you should use import other_module and call various functions:
import other_module
x, y = other_module.some_function(param='z')
If you can run the script, you also can import it.
If you want to use subprocess.Popen() then to pass a couple of floats, you could use json format: it is human readable, exact (in this case), and it is machine-readable. For example:
child.py:
#!/usr/bin/env python
import json
import sys
numbers = 1.2345, 1e-20
json.dump(numbers, sys.stdout)
parent.py:
#!/usr/bin/env python
import json
import sys
from subprocess import check_output
output = check_output([sys.executable, 'child.py'])
x, y = json.loads(output.decode())
Child.py actually calls ironpython and another script to run a .Net assembly. The reason why I HAVE to copy all the child.py files is because this assembly generates a resource file which is then used by it. If I don't copy child.py in each subfolders the resource files are copied at the root which creates conflicts when I call several processes using the multiprocessing module. If you have some suggestions about this overall architecture it is more than welcome :).
You can put the code from child.py into parent.py and call os.chdir() (after the fork) to execute each multiprocessing.Process in its own working directory or use cwd parameter (it sets the current working directory for the subprocess) if you run the assembly using subprocess module:
#!/usr/bin/env python
import os
import shutil
import tempfile
from multiprocessing import Pool
def init(topdir='.'):
dir = tempfile.mkdtemp(dir=topdir) # parent is responsible for deleting it
os.chdir(dir)
def child(n):
return os.getcwd(), n*n
if __name__ == "__main__":
pool = Pool(initializer=init)
results = pool.map(child, [1,2,3])
pool.close()
pool.join()
for dirname, _ in results:
try:
shutil.rmtree(dirname)
except EnvironmentError:
pass # ignore errors

What is the working directory for Python Windows services?

I have successfully created a Python Windows service using pywin32. In testing my application I attempted to have it print (which didn't work as I expected) and I also had it write to a file. It was able to write to a file, but the file ended up in the python library site-packages folder. This appears to be where the working directory is, though I'm not sure why? I would like to know the best way to specify what the working directory should be.
I could open files with full path names, or I could maybe use os.cwd? What is the best practice?
Here are the two files which compose my Windows service.
import os
import sys
import win32service
import win32serviceutil
from twisted.internet import reactor
import xpress
class XPressService(win32serviceutil.ServiceFramework):
_svc_name_ = 'XPress'
_svc_display_name_ = 'XPress Longer Name'
def SvcStop(self):
self.ReportServiceStatus(win32service.SERVICE_STOP_PENDING)
reactor.callFromThread(reactor.stop)
def SvcDoRun(self):
xpress.main()
reactor.run(installSignalHandlers=False)
if __name__ == "__main__":
win32serviceutil.HandleCommandLine(XPressService)
Below is "xpress.py" which is imported by the above script.
import datetime
def main():
with open('times', 'a') as f:
print str(datetime.datetime.now())
f.write(str(datetime.datetime.now()))
if __name__ == '__main__':
main()

They both work, it's what your needs are. For various reasons, it's probably best to use absolute paths to the file names, this way you don't have to worry about 'where' your app is working, you just know where the output will be (which is most important). In *nix, apps generally work in '/' when they don't have a specified working directory. If you do choose to work in another directory it's os.chdir(newDir), do this before you call win32serviceutil.HandleCommandLine
I don't know the windows default, but you probably nailed it with the library's directory in site-packages.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.