Analyse python project imports - python

In general I would like to understand what exactly code the code in my projects is actually using from a big framework.
First I want to know what are all the imports (possibly with static analysis), and then if possible which of these imports are actually used.
For the first problem I could use a regexp of course, but I would like to find a cleaner way.
but I don't see how with ast/inspect/parser.
And about the second problem I should be able to find out automatically if some of the imports are actually unused, but how can I do that?
EDIT:
about the second issue maybe the best way is a simple import hook, which just records everything it was imported and then call the default importing mechanisms.
So I tried something like:
class MyLoader(object):
"""
Loader object
"""
def __init__(self):
self.loaded = set()
def find_module(self, module_name, package=None):
print("requesting %s" % module_name)
self.loaded.add(module_name)
return self
def load_module(self, fullname):
fp, pathname, stuff = imp.find_module(fullname)
imp.load_module(fullname, fp, pathname, stuff)
But trying to import "random" I get
from future import division
ImportError: No module named future
Which I think means I'm missing something..
I haven't found any simple example of using imp to do some import introspection, any hints?

I'm pleased to say that listing out the imports is actually quite simple.
I need a minimal implementation of the Importer protocol (defined by PEP 302), where if find_module returns None it will just fallback to the next one.
This simple script can actually show the imports done by the program passed in:
import sys
class ImportInspector(object):
def find_module(self, module, path):
print("importing module %s" % module)
if __name__ == '__main__':
progname = sys.argv[0]
# shift by one position
sys.argv = sys.argv[1:]
sys.meta_path.append(ImportInspector())
code = compile(open(progname, 'rb').read(), progname, 'exec')
exec(code)
Given this, any kind of trick can be implemented on top of it.
For example we can keep track of the imports in a set and store them all when the program quits.
I think we might even get the hiearchy of the imports and produce a graph similar to what gprof2dot does, but only based on the analysis of the imports.

Problem with such an analysis would be the dynamic nature of python. In fact the set of modules that are used may be dependent on the runtime variable (i.e. some modules could be imported and used under certain runtime conditions only).
May be not the best way, but if you have pretty decent test coverage for your code, you can use coverage.py output to check what modules were loaded during the test execution.

Related

How to load a dylib-file as CPython-extension?

This has been asked before (e.g. here), but the given solution (i.e. renaming file to *.so) is not acceptable. I have a CPython extension called name.dylib, which cannot be imported. If the filename is changed to use name.so it is imported correctly. Changing the filename is not an option**, and should not be necessary.
Python has a lot of hooks for searching for modules, so there must be a way to make it recognise a dylib-file. Can someone show how to do this? Using a low level import which spells out the whole filename is not nice but is an acceptable solution.
** because the build code forces dylib, and, other contexts I have assume it. The extension module is dual purpose, it can by used both as an ordinary shared library and a Python extension. Using a symlink does work, but is a last resort because it requires manual intervention in an automated process.
You could manipulate sys.path_hooks and replace FileFinder-hook with one which would accept .dylib-extensions. But see also the simpler but less convinient alternative which would import given the full file-name of the extension.
More information how .so, .py and .pyc files are imported can be found for example in this answer of mine.
This manipulation could look like following:
import sys
import importlib
from importlib.machinery import FileFinder, ExtensionFileLoader
# pick right loader for .dylib-files:
dylib_extension = ExtensionFileLoader, ['.dylib']
# add dylib-support to file-extension supported per default
all_supported_loaders = [dylib_extension]+ importlib._bootstrap_external._get_supported_file_loaders()
# replace the last hook (i.e. FileFinder) with one recognizing `.dylib` as well:
sys.path_hooks.pop()
sys.path_hooks.append(FileFinder.path_hook(*all_supported_loaders))
#and now import name.dylib via
import name
This must be the first code executed, when python-script starts to run. Other modules might not expect sys.path_hooks being manipulated somewhere during the run of the program, so there might be some problems with other modules (like pdb, traceback and so). For example:
import pdb
#above code
import name
will fail, while
#above code
import pdb
import name
will work, as pdb seems to manipulate the import-machinery.
Normally, FileFinder-hook is the last in the sys.path_hooks, because it is the last resort, once path_hook_for_FileFinder is called for a path, the finder is returned (ImportError should be raised if PathFinder should look at further hooks):
def path_hook_for_FileFinder(path):
"""Path hook for importlib.machinery.FileFinder."""
if not _path_isdir(path):
raise ImportError('only directories are supported', path=path)
return cls(path, *loader_details) # HERE Finder is returned!
However, one might want to be sure and check, that really the right hook is replaced.
A simpler alternative would be to use imp.load_dynamic (neglecting for the moment that imp is depricated):
import imp
imp.load_dynamic('name', 'name.dylib') # or what ever path is used
That might be more robust than the first solution (no problems with pdb for example) but less convinient for bigger projects.

How to modify imported source code on-the-fly?

Suppose I have a module file like this:
# my_module.py
print("hello")
Then I have a simple script:
# my_script.py
import my_module
This will print "hello".
Let's say I want to "override" the print() function so it returns "world" instead. How could I do this programmatically (without manually modifying my_module.py)?
What I thought is that I need somehow to modify the source code of my_module before or while importing it. Obvisouly, I cannot do this after importing it so solution using unittest.mock are impossible.
I also thought I could read the file my_module.py, perform modification, then load it. But this is ugly, as it will not work if the module is located somewhere else.
The good solution, I think, is to make use of importlib.
I read the doc and found a very intersecting method: get_source(fullname). I thought I could just override it:
def get_source(fullname):
source = super().get_source(fullname)
source = source.replace("hello", "world")
return source
Unfortunately, I am a bit lost with all these abstract classes and I do not know how to perform this properly.
I tried vainly:
spec = importlib.util.find_spec("my_module")
spec.loader.get_source = mocked_get_source
module = importlib.util.module_from_spec(spec)
Any help would be welcome, please.
Here's a solution based on the content of this great talk. It allows any arbitrary modifications to be made to the source before importing the specified module. It should be reasonably correct as long as the slides did not omit anything important. This will only work on Python 3.5+.
import importlib
import sys
def modify_and_import(module_name, package, modification_func):
spec = importlib.util.find_spec(module_name, package)
source = spec.loader.get_source(module_name)
new_source = modification_func(source)
module = importlib.util.module_from_spec(spec)
codeobj = compile(new_source, module.__spec__.origin, 'exec')
exec(codeobj, module.__dict__)
sys.modules[module_name] = module
return module
So, using this you can do
my_module = modify_and_import("my_module", None, lambda src: src.replace("hello", "world"))
This doesn't answer the general question of dynamically modifying the source code of an imported module, but to "Override" or "monkey-patch" its use of the print() function can be done (since it's a built-in function in Python 3.x). Here's how:
#!/usr/bin/env python3
# my_script.py
import builtins
_print = builtins.print
def my_print(*args, **kwargs):
_print('In my_print: ', end='')
return _print(*args, **kwargs)
builtins.print = my_print
import my_module # -> In my_print: hello
I first needed to better understand the import operation. Fortunately, this is well explained in the importlib documentation and scratching through the source code helped too.
This import process is actually split in two parts. First, a finder is in charge of parsing the module name (including dot-separated packages) and instantiating an appropriate loader. Indeed, built-in are not imported as local modules for example. Then, the loader is called based on what the finder returned. This loader get the source from a file or from a cache, and executed the code if the module was not previously loaded.
This is very simple. This explains why I actually did not need to use abstract classes from importutil.abc: I do not want to provide my own import process. Instead, I could create a subclass inherited from one of the classes from importuil.machinery and override get_source() from SourceFileLoader for example. However, this is not the way to go because the loader is instantiated by the finder so I do not have the hand on which class is used. I cannot specify that my subclass should be used.
So, the best solution is to let the finder do its job, and then replace the get_source() method of whatever Loader has been instantiated.
Unfortunately, by looking trough the code source I saw that the basic Loaders are not using get_source() (which is only used by the the inspect module). So my whole idea could not work.
In the end, I guess get_source() should be called manually, then the returned source should be modified, and finally the code should be executed. This is what Martin Valgur detailed in his answer.
If compatibility with Python 2 is needed, I see no other way than reading the source file:
import imp
import sys
import types
module_name = "my_module"
file, pathname, description = imp.find_module(module_name)
with open(pathname) as f:
source = f.read()
source = source.replace('hello', 'world')
module = types.ModuleType(module_name)
exec(source, module.__dict__)
sys.modules[module_name] = module
If importing the module before the patching it is okay, then a possible solution would be
import inspect
import my_module
source = inspect.getsource(my_module)
new_source = source.replace('"hello"', '"world"')
exec(new_source, my_module.__dict__)
If you're after a more general solution, then you can also take a look at the approach I used in another answer a while ago.
My solution updates the source file, which works for the inner import situation. The inner import means that transformers.models.albert import modeling_albert from the source file. In such case, even if I use the solution from Martin Valgur, it won't work. So I update the source file. Hope it help the people who have the same trouble with me.
import inspect
from transformers.models.albert import modeling_albert
# Get source
source = inspect.getsource(modeling_albert)
source_before = "AlbertModel(config, add_pooling_layer=False)"
source_after = "AlbertModel(config, add_pooling_layer=True)"
new_source = source.replace(source_before, source_after)
# Update file
file_path = modeling_albert.__spec__.origin
with open(file_path, 'w') as f:
f.write(new_source)
Not elegant, but works for me (may have to add a path):
with open ('my_module.py') as aFile:
exec (aFile.read () .replace (<something>, <something else>))

Load modules conditionally Python

I'm wrote a main python module that need load a file parser to work, initially I was a only one text parser module, but I need add more parsers for different cases.
parser_class1.py
parser_class2.py
parser_class3.py
Only one is required for every running instance, then I'm thinking load it by command line:
mmain.py -p parser_class1
With this purpose I wrote this code in order to select the parser to load when the main module will be called:
#!/usr/bin/env python
import argparse
aparser = argparse.ArgumentParser()
aparser.add_argument('-p',
action='store',
dest='module',
help='-p module to import')
results = aparser.parse_args()
if not results.module:
aparser.error('Error! no module')
try:
exec("import %s" %(results.module))
print '%s imported done!'%(results.module)
except ImportError, e:
print e
But, I was reading that this way is dangerous, maybe no stardard..
Then, is this approach ok? or I must find another way to do it?
Why?
Thanks, any comment are welcome.
You could actually just execute the import statement inside a conditional block:
if x:
import module1a as module1
else:
import module1b as module1
You can account for various whitelisted module imports in different ways using this, but effectively the idea is to pre-program the imports, and then essentially use a GOTO to make the proper imports... If you do want to just let the user import any arbitrary argument, then the __import__ function would be the way to go, rather than eval.
Update:
As #thedox mentioned in the comment, the as module1 section is the idiomatic way for loading similar APIs with different underlying code.
In the case where you intend to do completely different things with entirely different APIs, that's not the pattern to follow.
A more reasonable pattern in this case would be to include the code related to a particular import with that import statement:
if ...:
import module1
# do some stuff with module1 ...
else:
import module2
# do some stuff with module2 ...
As for security, if you allow the user to cause an import of some arbitrary code-set (e.g. their own module, perhaps?), it's not much different than using eval on user-input. It's essentially the same vulnerability: the user can get your program to execute their own code.
I don't think there's a truly safe manner to let the user import arbitrary modules, at all. The exception here is if they have no access to the file-system, and therefore cannot create new code to be imported, in which case you're basically back to the whitelist case, and may as well implement an explicit whitelist to prevent future-vulnerabilities if/when at some point in the future the user does gain file-system access.
here is how to use __import__()
allowed_modules = ['os', 're', 'your_module', 'parser_class1.py', 'parser_class2.py']
if not results.module:
aparser.error('Error! no module')
try:
if results.module in allowed_modules:
module = __import__(results.module)
print '%s imported as "module"'%(results.module)
else:
print 'hey what are you trying to do?'
except ImportError, e:
print e
module.your_function(your_data)
EVAL vs __IMPORT__()
using eval allows the user to run any code on your computer. Don't do that. __import__() only allows the user to load modules, apparently not allowing user to run arbitrary code. But it's only apparently safer.
The proposed function, without allowed_modules is still risky since it can allow to load an arbitrary model that may have some malicious code running on when loaded. Potentially the attacker can load a file somewhere (a shared folder, a ftp folder, a upload folder managed by your webserver ...) and call it using your argument.
WHITELISTS
Using allowed_modules mitigates the problem but do not solve it completely: to hardening even more you still have to check if the attacker wrote a "os.py", "re.py", "your_module.py", "parser_class1.py" into your script folder, since python first searches module there (docs).
Eventually you may compare parser_class*.py code against a list of hashes, like sha1sum does.
FINAL REMARKS: At the real end, if user has write access to your script folder you cannot ensure an absolutely safe code.
You should think of all of the possible modules you may import for that parsing function and then use a case statement or dictionary to load the correct one. For example:
import parser_class1, parser_class2, parser_class3
parser_map = {
'class1': parser_class1,
'class2': parser_class2,
'class3': parser_class3,
}
if not args.module:
#report error
parser = None
else:
parser = parser_map[args.module]
#perform work with parser
If loading any of the parser_classN modules in this example is expensive, you can define lambdas or functions that return that module (i.e. def get_class1(): import parser_class1; return parser_class1) and alter the line to be parser = parser_map[args.module]()
The exec option could be very dangerous because you're executing unvalidated user input. Imagine if your user did something like -
mmain.py -p "parser_class1; some_function_or_code_that_is_malicious()"

How do I override a Python import?

I'm working on pypreprocessor which is a preprocessor that takes c-style directives and I've been able to make it work like a traditional preprocessor (it's self-consuming and executes postprocessed code on-the-fly) except that it breaks library imports.
The problem is: The preprocessor runs through the file, processes it, outputs to a temporary file, and exec() the temporary file. Libraries that are imported need to be handled a little different, because they aren't executed, but rather they are loaded and made accessible to the caller module.
What I need to be able to do is: Interrupt the import (since the preprocessor is being run in the middle of the import), load the postprocessed code as a tempModule, and replace the original import with the tempModule to trick the calling script with the import into believing that the tempModule is the original module.
I have searched everywhere and so far and have no solution.
This Stack Overflow question is the closest I've seen so far to providing an answer:
Override namespace in Python
Here's what I have.
# Remove the bytecode file created by the first import
os.remove(moduleName + '.pyc')
# Remove the first import
del sys.modules[moduleName]
# Import the postprocessed module
tmpModule = __import__(tmpModuleName)
# Set first module's reference to point to the preprocessed module
sys.modules[moduleName] = tmpModule
moduleName is the name of the original module, and tmpModuleName is the name of the postprocessed code file.
The strange part is this solution still runs completely normal as if the first module completed loaded normally; unless you remove the last line, then you get a module not found error.
Hopefully someone on Stack Overflow know a lot more about imports than I do, because this one has me stumped.
Note: I will only award a solution, or, if this is not possible in Python; the best, most detailed explanation of why this is not impossible.
Update: For anybody who is interested, here is the working code.
if imp.lock_held() is True:
del sys.modules[moduleName]
sys.modules[tmpModuleName] = __import__(tmpModuleName)
sys.modules[moduleName] = __import__(tmpModuleName)
The 'imp.lock_held' part detects whether the module is being loaded as a library. The following lines do the rest.
Does this answer your question? The second import does the trick.
Mod_1.py
def test_function():
print "Test Function -- Mod 1"
Mod_2.py
def test_function():
print "Test Function -- Mod 2"
Test.py
#!/usr/bin/python
import sys
import Mod_1
Mod_1.test_function()
del sys.modules['Mod_1']
sys.modules['Mod_1'] = __import__('Mod_2')
import Mod_1
Mod_1.test_function()
To define a different import behavior or to totally subvert the import process you will need to write import hooks. See PEP 302.
For example,
import sys
class MyImporter(object):
def find_module(self, module_name, package_path):
# Return a loader
return self
def load_module(self, module_name):
# Return a module
return self
sys.meta_path.append(MyImporter())
import now_you_can_import_any_name
print now_you_can_import_any_name
It outputs:
<__main__.MyImporter object at 0x009F85F0>
So basically it returns a new module (which can be any object), in this case itself. You may use it to alter the import behavior by returning processe_xxx on import of xxx.
IMO: Python doesn't need a preprocessor. Whatever you are accomplishing can be accomplished in Python itself due to it very dynamic nature, for example, taking the case of the debug example, what is wrong with having at top of file
debug = 1
and later
if debug:
print "wow"
?
In Python 2 there is the imputil module that seems to provide the functionality you are looking for, but has been removed in python 3. It's not very well documented but contains an example section that shows how you can replace the standard import functions.
For Python 3 there is the importlib module (introduced in Python 3.1) that contains functions and classes to modify the import functionality in all kinds of ways. It should be suitable to hook your preprocessor into the import system.

Is this correct way to import python scripts residing in arbitrary folders?

This snippet is from an earlier answer here on SO. It is about a year old (and the answer was not accepted). I am new to Python and I am finding the system path a real pain. I have a few functions written in scripts in different directories, and I would like to be able to import them into new projects without having to jump through hoops.
This is the snippet:
def import_path(fullpath):
""" Import a file with full path specification. Allows one to
import from anywhere, something __import__ does not do.
"""
path, filename = os.path.split(fullpath)
filename, ext = os.path.splitext(filename)
sys.path.append(path)
module = __import__(filename)
reload(module) # Might be out of date
del sys.path[-1]
return module
Its from here:
How to do relative imports in Python?
I would like some feedback as to whether I can use it or not - and if there are any undesirable side effects that may not be obvious to a newbie.
I intend to use it something like this:
import_path(/home/pydev/path1/script1.py)
script1.func1()
etc
Is it 'safe' to use the function in the way I intend to?
The "official" and fully safe approach is the imp module of the standard Python library.
Use imp.find_module to find the module on your precisely-specified list of acceptable directories -- it returns a 3-tuple (file, pathname, description) -- if unsuccessful, file is actually None (but it can also raise ImportError so you should use a try/except for that as well as checking if file is None:).
If the search is successful, call imp.load_module (in a try/finally to make sure you close the file!) with the above three arguments after the first one which must be the same name you passed to find_module -- it returns the module object (phew;-).
As mentioned, please consider thread safety, if appropriate. I prefer something closer to a solution posted in a similar post. The main differences below: the use of insert to specify priority of the import, correct restoration of sys.path using try...finally, and setting the global namespace.
# inspired by Alex Martelli's solution to
# http://stackoverflow.com/questions/1096216/override-namespace-in-python/1096247#1096247
def import_from_absolute_path(fullpath, global_name=None):
"""Dynamic script import using full path."""
import os
import sys
script_dir, filename = os.path.split(fullpath)
script, ext = os.path.splitext(filename)
sys.path.insert(0, script_dir)
try:
module = __import__(script)
if global_name is None:
global_name = script
globals()[global_name] = module
sys.modules[global_name] = module
finally:
del sys.path[0]
It does feel like a bit of a hack, but at the moment, I can't think of any unintended side effects that are likely to occur, at least not as long as you're just using this for your own scripts. Basically what it does is temporarily add the parent directory of the specified file (in your example, /home/pydev/path1/) to the list of paths that Python checks when it's looking for a module to import.
The only risk I can think of right now would arise in a multithreaded environment, where two or more threads (or processes) are running this function simultaneously. If thread A wants to import module A from path dirA/A.py, and thread B wants to import module B from path dirB/B.py, you'd wind up with both dirA and dirB in sys.path for a short time. And if there is a file named B.py in dirA, it's possible that thread B will find that (dirA/B.py) instead of the file it's looking for (dirB/B.py), thus importing the wrong module. For this reason, I wouldn't use it in production code, or code that you're going to distribute to other people (at least not without warning them that this hack is in here!). In a situation like that, you could write a more complex function that allows you to specify the file to import without messing with the standard set of paths. (That's what mod_python does, for example)
I would be worried that your script name might correspond with a module that shows up earlier in the path. To dispel this fear, I would fully replace the path with a new list containing just the directory containing the module, then put it back once the import has completed. Also, you should wrap this in some sort of lock so that multiple threads trying to do the same thing don't interfere with each other.

Categories