So i have a problem. I'm trying to make my imports faster, so i started using multiprocessing module to split a group of imports into two functions, and then run each on separate core, thus speeding the imports up. But now the code will not recognize the modules at all. What am I doing wrong ?
import multiprocessing
def core1():
import wikipedia
import subprocess
import random
return wikipedia, subprocess, random
def core2():
from urllib import request
import json
import webbrowser
return request, json, webbrowser
if __name__ == "__main__":
start_core_1 = multiprocessing.Process(name='worker 1', target=core1, args = core2())
start_core_2 = multiprocessing.Process(name='worker 2', target=core2, args = core1())
start_core_1.start()
start_core_2.start()
while True:
user = input('[!] ')
with request.urlopen('https://api.wit.ai/message?v=20160511&q=%s&access_token=Z55PIVTSSFOETKSBPWMNPE6YL6HVK4YP' % request.quote(user)) as wit_api: # call to wit.ai api
wit_api_html = wit_api.read()
wit_api_html = wit_api_html.decode()
wit_api_data = json.loads(wit_api_html)
intent = wit_api_data['entities']['Intent'][0]['value']
term = wit_api_data['entities']['search_term'][0]['value']
if intent == 'info_on':
with request.urlopen('https://kgsearch.googleapis.com/v1/entities:search?query=%s&key=AIzaSyCvgNV4G7mbnu01xai0f0k9NL2ito8vY6s&limit=1&indent=True' % term.replace(' ', '%20')) as response:
google_knowledge_base_html = response.read()
google_knowledge_base_html = google_knowledge_base_html.decode()
google_knowledge_base_data = json.loads(google_knowledge_base_html)
print(google_knowledge_base_data['itemListElement'][0]['result']['detailedDescription']['articleBody'])
else:
print('Something')
I think you are missing the important parts of the whole picture i.e. crucial parts of what you need to know about multiprocessing when using it.
Here are some crucial parts that you have to know and then you will understand why you can't just import modules in child process and speed up the thing. Even returning loaded modules is not a perfect answer too.
First, when you use multiprocess.Process a child process is forked (on Linux) or spawned (on Windows). I'll assume you are using Linux. In that case, every child process inherits every loaded module from parent (global state). When child process changes anything, like global variables or imports new modules, those stay just in its context. So, parent process is not aware of it. I believe part of this can also be of interest.
Second, module can be a set of classes, external lib bindings, functions, etc. and some of them quite probably can't be pickled, at least with pickle. Here is the list of what can be pickled in Python 2.7 and in Python 3.X. There are even libraries that give you 'more pickling power' like dill. However, I'm not sure pickling whole modules is a good idea at all, not to mention that you have slow imports and yet you want to serialize them and send them to parent process. Even if you manage to do it, it doesn't sound like a best approach.
Some of the ideas on how to change the perspective:
Try to revise which module you need and why? Maybe you can use other modules that can give you similar functionalities. Maybe these modules are overweighing and bringing too much with them and cost is great in comparing to what you get.
If you have slow loading of modules, try to make a script that will always be running, so you do not have to run it multiple times.
If you really need those modules maybe you can separate their using in two processes and then each process does it's own thing. Example would be, one process parses page, other process processes and so on. That way you sped up the loading but you have to deal with passing messages between processes.
Related
Is it possible to reuse python module already loaded in memory ?
Let us say I have scripts loader.py and consume.py. I am try to do next thing - invoke loader.py and reuse it in consume.py . First script should load in memory big file, second one will be invoked many times and use big file.
Can I achieve this? I am not familar with python but I guess, there should be a way to access loaded module (script) in memory.
My current implementation attempt looks like this:
loader.py
x = 3
print 'module loaded'
consume.py
from loader import x
print x
Update
I have tried to use importlib as it was described here and here, but my loader module loads every time again. Below is my code for cosume.py
import importlib
module = importlib.import_module('loader')
globals().update(
{n: getattr(module, n) for n in module.__all__} if hasattr(module, '__all__')
else
{k: v for (k, v) in module.__dict__.items() if not k.startswith('_')
})
print(x)
Final goal
Invoke consume script many times from nodejs and not to load big file every time. Need to share data between script executions
Define a function in consume.py that does the work you want to do. In fact, it should all be functions. You could have three files, one where you define functions that load the data, one where you define functions that consume data, and one where you combine them into some process.
For example, one module loads data:
# loader.py
def load_data():
# load the data
One module where you write functions that consume data:
# consume.py
def consume_data(data):
# do stuff with the data
def consume_data_differently(data):
# do other stuff with the data
and a script that actually does stuff:
# do_stuff.py
from loader import load_data
from consume import consume_data
data = load_data()
for d in data: # consume pieces of data in a loop
consume_data(d)
Settings things up like this gives you much more flexibility than relying on the import mechanism to run code, which isn't what it's designed for.
Addendum based on your update: you're making things much harder than they need to be. You really, really don't need to play around with importlib and globals() in normal code. Those are tools for building libraries, not doing data analysis.
As we all know we need to protect the main() when running code with multiprocessing in Python using if __name__ == '__main__'.
I understand that this is necessary in some cases to give access to functions defined in the main but I do not understand why this is necessary in this case:
file2.py
import numpy as np
from multiprocessing import Pool
class Something(object):
def get_image(self):
return np.random.rand(64,64)
def mp(self):
image = self.get_image()
p = Pool(2)
res1 = p.apply_async(np.sum, (image,))
res2 = p.apply_async(np.mean, (image,))
print(res1.get())
print(res2.get())
p.close()
p.join()
main.py
from file2 import Something
s = Something()
s.mp()
All of the functions or imports necessary for Something to work are part of file2.py. Why does the subprocess need to re-run the main.py?
I think the __name__ solution is not very nice as this prevents me from distribution the code of file2.py as I can't make sure they are protecting their main.
Isn't there a workaround for Windows?
How are packages solving that (as I never encountered any problem not protecting my main with any package - are they just not using multiprocessing?)
edit:
I know that this is because of the fork() not implemented in Windows. I was just asking if there is a hack to let the interpreter start at file2.py instead of main.py as I can be sure that file2.py is self-sufficient
When using the "spawn" start method, new processes are Python interpreters that are started from scratch. It's not possible for the new Python interpreters in the subprocesses to figure out what modules need to be imported, so they import the main module again, which in turn will import everything else. This means it must be possible to import the main module without any side effects.
If you are on a different platform than Windows, you can use the "fork" start method instead, and you won't have this problem.
That said, what's wrong with using if __name__ == "__main__":? It has a lot of additional benefits, e.g. documentation tools will be able to process your main module, and unit testing is easier etc, so you should use it in any case.
As others have mentioned the spawn() method on Windows will re-import the code for each instance of the interpreter. This import will execute your code again in the child process (and this will make it create it own child, and so on).
A workaround is to pull the multiprocessing script into a separate file and then use subprocess to launch it from the main script.
I pass variables into the script by pickling them in a temporary directory, and I pass the temporary directory into the subprocess with argparse.
I then pickle the results into the temporary directory, where the main script retrieves them.
Here is an example file_hasher() function that I wrote:
main_program.py
import os, pickle, shutil, subprocess, sys, tempfile
def file_hasher(filenames):
try:
subprocess_directory = tempfile.mkdtemp()
input_arguments_file = os.path.join(subprocess_directory, 'input_arguments.dat')
with open(input_arguments_file, 'wb') as func_inputs:
pickle.dump(filenames, func_inputs)
current_path = os.path.dirname(os.path.realpath(__file__))
file_hasher = os.path.join(current_path, 'file_hasher.py')
python_interpreter = sys.executable
proc = subprocess.call([python_interpreter, file_hasher, subprocess_directory],
timeout=60,
)
output_file = os.path.join(subprocess_directory, 'function_outputs.dat')
with open(output_file, 'rb') as func_outputs:
hashlist = pickle.load(func_outputs)
finally:
shutil.rmtree(subprocess_directory)
return hashlist
file_hasher.py
#! /usr/bin/env python
import argparse, hashlib, os, pickle
from multiprocessing import Pool
def file_hasher(input_file):
with open(input_file, 'rb') as f:
data = f.read()
md5_hash = hashlib.md5(data)
hashval = md5_hash.hexdigest()
return hashval
if __name__=='__main__':
argument_parser = argparse.ArgumentParser()
argument_parser.add_argument('subprocess_directory', type=str)
subprocess_directory = argument_parser.parse_args().subprocess_directory
arguments_file = os.path.join(subprocess_directory, 'input_arguments.dat')
with open(arguments_file, 'rb') as func_inputs:
filenames = pickle.load(func_inputs)
hashlist = []
p = Pool()
for r in p.imap(file_hasher, filenames):
hashlist.append(r)
output_file = os.path.join(subprocess_directory, 'function_outputs.dat')
with open(output_file, 'wb') as func_outputs:
pickle.dump(hashlist, func_outputs)
There must be a better way...
The main module is imported (but with __name__ != '__main__' because Windows is trying to simulate a forking-like behavior on a system that doesn't have forking). multiprocessing has no way to know that you didn't do anything important in you main module, so the import is done "just in case" to create an environment similar to the one in your main process. If it didn't do this, all sorts of stuff that happens by side-effect in main (e.g. imports, configuration calls with persistent side-effects, etc.) might not be properly performed in the child processes.
As such, if they're not protecting their __main__, the code is not multiprocessing safe (nor is it unittest safe, import safe, etc.). The if __name__ == '__main__': protective wrapper should be part of all correct main modules. Go ahead and distribute it, with a note about requiring multiprocessing-safe main module protection.
the if __name__ == '__main__' is needed on windows since windows doesnt have a "fork" option for processes.
In linux, for example, you can fork the process, so the parent process will be copied and the copy will become the child process (and it will have access to the already imported code you had loaded in the parent process)
Since you cant fork in windows, python simply imports all the code that was imported by the parent process, in the child process. This creates a similar effect, but if you dont do the __name__ trick, this import will execute your code again in the child process (and this will make it create it own child, and so on).
so even in your example main.py will be imported again (since all the files are imported again). python cant guess what specific python script the child process should import.
FYI there are other limitations you should be aware of like using globals, you can read about it here https://docs.python.org/2/library/multiprocessing.html#windows
What happens to imported modules variables when a new process is spawned?
IE
with concurrent.futures.ProcessPoolExecutor(max_workers=settings.MAX_PROCESSES) as executor:
for stuff in executor.map(foo, paths):
where:
def foo(str):
x = someOtherModule.fooBar()
where foobar is accessing things declared at the start of someOtherModule:
someOtherModule.py:
myHat='green'
def fooBar():
return myHat
Specifically, I have a module (called Y) that has a py4j gateway initialized at the top, outside of any function. In module X I'm loading several files at once, and the function that sorts through the data after loading uses a function in Y which in turn uses the gateway.
Is this design pythonic?
Should I be importing my Y module after each new process is spawned? OR is there a better way to do this?
On Linux, fork will be used to spawn the child, so anything in the global scope of the parent will also be available in the child, with copy-on-write semantics.
On Windows, anything you import at the module-level in the __main__ module of the parent process will get re-imported in the child.
This means that if you have a parent module (let's call it someModule) like this:
import someOtherModule
import concurrent.futures
def foo(str):
x = someOtherModule.fooBar()
if __name__ == "__main__":
with concurrent.futures.ProcessPoolExecutor(max_workers=settings.MAX_PROCESSES) as executor:
for stuff in executor.map(foo, paths):
# stuff
And someOtherModule looks like this:
myHat='green'
def fooBar():
return myHat
In this example, someModule is the __main__ module of the script. So, on Linux, the myHat instance you get in the child will be a copy-on-write version of the one in someModule. On Windows, each child process will re-import someModule as soon as they load, which will result in someOtherModule being re-imported as well.
I don't know enough about py4j Gateway objects to tell if you for sure if this is the behavior you want or not. If the Gateway object is pickleable, you could explicitly pass it to each child instead, but you'd have to use a multiprocessing.Pool instead of concurrent.futures.ProcessPoolExecutor:
import someOtherModule
import multiprocessing
def foo(str):
x = someOtherModule.fooBar()
def init(hat):
someOtherModule.myHat = hat
if __name__ == "__main__":
hat = someOtherModule.myHat
pool = multiprocessing.Pool(settings.MAX_PROCESSES,
initializer=init, initargs=(hat,))
for stuff in pool.map(foo, paths):
# stuff
It doesn't seem like you have a need to do this for you use-case, though. You're probably fine using the re-import.
When you create a new process, a fork() is called, which clones the entire process and stack, memory space etc. This is why multi-processing is considered more expensive than multi-threading since the copying is expensive.
So to answer your question, all "imported module variables" are cloned. You can modify them as you wish, but your original parent process won't see this change.
EDIT:
This for Unix based systems only. See Dano's answer for Unix+Windows answer.
I'm wrote a main python module that need load a file parser to work, initially I was a only one text parser module, but I need add more parsers for different cases.
parser_class1.py
parser_class2.py
parser_class3.py
Only one is required for every running instance, then I'm thinking load it by command line:
mmain.py -p parser_class1
With this purpose I wrote this code in order to select the parser to load when the main module will be called:
#!/usr/bin/env python
import argparse
aparser = argparse.ArgumentParser()
aparser.add_argument('-p',
action='store',
dest='module',
help='-p module to import')
results = aparser.parse_args()
if not results.module:
aparser.error('Error! no module')
try:
exec("import %s" %(results.module))
print '%s imported done!'%(results.module)
except ImportError, e:
print e
But, I was reading that this way is dangerous, maybe no stardard..
Then, is this approach ok? or I must find another way to do it?
Why?
Thanks, any comment are welcome.
You could actually just execute the import statement inside a conditional block:
if x:
import module1a as module1
else:
import module1b as module1
You can account for various whitelisted module imports in different ways using this, but effectively the idea is to pre-program the imports, and then essentially use a GOTO to make the proper imports... If you do want to just let the user import any arbitrary argument, then the __import__ function would be the way to go, rather than eval.
Update:
As #thedox mentioned in the comment, the as module1 section is the idiomatic way for loading similar APIs with different underlying code.
In the case where you intend to do completely different things with entirely different APIs, that's not the pattern to follow.
A more reasonable pattern in this case would be to include the code related to a particular import with that import statement:
if ...:
import module1
# do some stuff with module1 ...
else:
import module2
# do some stuff with module2 ...
As for security, if you allow the user to cause an import of some arbitrary code-set (e.g. their own module, perhaps?), it's not much different than using eval on user-input. It's essentially the same vulnerability: the user can get your program to execute their own code.
I don't think there's a truly safe manner to let the user import arbitrary modules, at all. The exception here is if they have no access to the file-system, and therefore cannot create new code to be imported, in which case you're basically back to the whitelist case, and may as well implement an explicit whitelist to prevent future-vulnerabilities if/when at some point in the future the user does gain file-system access.
here is how to use __import__()
allowed_modules = ['os', 're', 'your_module', 'parser_class1.py', 'parser_class2.py']
if not results.module:
aparser.error('Error! no module')
try:
if results.module in allowed_modules:
module = __import__(results.module)
print '%s imported as "module"'%(results.module)
else:
print 'hey what are you trying to do?'
except ImportError, e:
print e
module.your_function(your_data)
EVAL vs __IMPORT__()
using eval allows the user to run any code on your computer. Don't do that. __import__() only allows the user to load modules, apparently not allowing user to run arbitrary code. But it's only apparently safer.
The proposed function, without allowed_modules is still risky since it can allow to load an arbitrary model that may have some malicious code running on when loaded. Potentially the attacker can load a file somewhere (a shared folder, a ftp folder, a upload folder managed by your webserver ...) and call it using your argument.
WHITELISTS
Using allowed_modules mitigates the problem but do not solve it completely: to hardening even more you still have to check if the attacker wrote a "os.py", "re.py", "your_module.py", "parser_class1.py" into your script folder, since python first searches module there (docs).
Eventually you may compare parser_class*.py code against a list of hashes, like sha1sum does.
FINAL REMARKS: At the real end, if user has write access to your script folder you cannot ensure an absolutely safe code.
You should think of all of the possible modules you may import for that parsing function and then use a case statement or dictionary to load the correct one. For example:
import parser_class1, parser_class2, parser_class3
parser_map = {
'class1': parser_class1,
'class2': parser_class2,
'class3': parser_class3,
}
if not args.module:
#report error
parser = None
else:
parser = parser_map[args.module]
#perform work with parser
If loading any of the parser_classN modules in this example is expensive, you can define lambdas or functions that return that module (i.e. def get_class1(): import parser_class1; return parser_class1) and alter the line to be parser = parser_map[args.module]()
The exec option could be very dangerous because you're executing unvalidated user input. Imagine if your user did something like -
mmain.py -p "parser_class1; some_function_or_code_that_is_malicious()"
I'm working on pypreprocessor which is a preprocessor that takes c-style directives and I've been able to make it work like a traditional preprocessor (it's self-consuming and executes postprocessed code on-the-fly) except that it breaks library imports.
The problem is: The preprocessor runs through the file, processes it, outputs to a temporary file, and exec() the temporary file. Libraries that are imported need to be handled a little different, because they aren't executed, but rather they are loaded and made accessible to the caller module.
What I need to be able to do is: Interrupt the import (since the preprocessor is being run in the middle of the import), load the postprocessed code as a tempModule, and replace the original import with the tempModule to trick the calling script with the import into believing that the tempModule is the original module.
I have searched everywhere and so far and have no solution.
This Stack Overflow question is the closest I've seen so far to providing an answer:
Override namespace in Python
Here's what I have.
# Remove the bytecode file created by the first import
os.remove(moduleName + '.pyc')
# Remove the first import
del sys.modules[moduleName]
# Import the postprocessed module
tmpModule = __import__(tmpModuleName)
# Set first module's reference to point to the preprocessed module
sys.modules[moduleName] = tmpModule
moduleName is the name of the original module, and tmpModuleName is the name of the postprocessed code file.
The strange part is this solution still runs completely normal as if the first module completed loaded normally; unless you remove the last line, then you get a module not found error.
Hopefully someone on Stack Overflow know a lot more about imports than I do, because this one has me stumped.
Note: I will only award a solution, or, if this is not possible in Python; the best, most detailed explanation of why this is not impossible.
Update: For anybody who is interested, here is the working code.
if imp.lock_held() is True:
del sys.modules[moduleName]
sys.modules[tmpModuleName] = __import__(tmpModuleName)
sys.modules[moduleName] = __import__(tmpModuleName)
The 'imp.lock_held' part detects whether the module is being loaded as a library. The following lines do the rest.
Does this answer your question? The second import does the trick.
Mod_1.py
def test_function():
print "Test Function -- Mod 1"
Mod_2.py
def test_function():
print "Test Function -- Mod 2"
Test.py
#!/usr/bin/python
import sys
import Mod_1
Mod_1.test_function()
del sys.modules['Mod_1']
sys.modules['Mod_1'] = __import__('Mod_2')
import Mod_1
Mod_1.test_function()
To define a different import behavior or to totally subvert the import process you will need to write import hooks. See PEP 302.
For example,
import sys
class MyImporter(object):
def find_module(self, module_name, package_path):
# Return a loader
return self
def load_module(self, module_name):
# Return a module
return self
sys.meta_path.append(MyImporter())
import now_you_can_import_any_name
print now_you_can_import_any_name
It outputs:
<__main__.MyImporter object at 0x009F85F0>
So basically it returns a new module (which can be any object), in this case itself. You may use it to alter the import behavior by returning processe_xxx on import of xxx.
IMO: Python doesn't need a preprocessor. Whatever you are accomplishing can be accomplished in Python itself due to it very dynamic nature, for example, taking the case of the debug example, what is wrong with having at top of file
debug = 1
and later
if debug:
print "wow"
?
In Python 2 there is the imputil module that seems to provide the functionality you are looking for, but has been removed in python 3. It's not very well documented but contains an example section that shows how you can replace the standard import functions.
For Python 3 there is the importlib module (introduced in Python 3.1) that contains functions and classes to modify the import functionality in all kinds of ways. It should be suitable to hook your preprocessor into the import system.