How to tell whether a Python function with dependencies has changed? - python

tl;dr:
How can I cache the results of a Python function to disk and in a later session use the cached value if and only if the function code and all of its dependencies are unchanged since I last ran it?
In other words, I want to make a Python caching system that automatically watches out for changed code.
Background
I am trying to build a tool for automatic memoization of computational results from Python. I want the memoization to persist between Python sessions (i.e. be reusable at a later time in another Python instance, preferrably even on another machine with the same Python version).
Assume I have a Python module mymodule with some function mymodule.func(). Let's say I already solved the problem of serializing/identifying the function arguments, so we can assume that mymodule.func() takes no arguments if it simplifies anything.
Also assume that I guarantee that the function mymodule.func() and all its dependencies are deterministic, so mymodule.func() == mymodule.func().
The task
I want to run the function mymodule.func() today and save its results (and any other information necessary to solve this task). When I want the same result at a later time, I would like to load the cached result instead of running mymodule.func() again, but only if the code in mymodule.func() and its dependencies are unchanged.
To simplify things, we can assume that the function is always run in a freshly started Python interpreter with a minimal script like this:
import some_save_function
import mymodule
result = mymodule.func()
some_save_function(result, 'filename')
Also, note that I don't want to be overly conservative. It is probably not too hard to use the modulefinder module to find all modules involved when running the first time, and then not use the cache if any module has changed at all. But this defeats my purpose, because in my use case it is very likely that some unrelated function in an imported module has changed.
Previous work and tools I have looked at
joblib memoizes results tied to the function name, and also saves the source code so we can check if it is unchanged. However, as far as I understand it does not check upstream functions (called by mymodule.func()).
The ast module gives me the Abstract Syntax Tree of any Python code, so I guess I can (in principle) figure it all out that way. How hard would this be? I am not very familiar with the AST.
Can I use any of all the black magic that's going on inside dill?
More trivia than a solution: IncPy, a finished/deceased research project, implemented a Python interpreter doing this by default, always. Nice idea, but never made it outside the lab.
Grateful for any input!

Related

Logging progress of `enumerate` function in Python

I am working on optimizing a Python (version 2.7; constrained due to old modules I need to use) script that is running on AWS and trying to identify exactly how many resources I need in the environment I am building. In order to do so, I am logging quite a bit of information to the console when the script runs and benchmarking different resource configurations.
One of the bottlenecks is a list with 32,000 items that runs through the Python enumerate function. This takes quite a while to run and my script is blind to its progress until it finishes. I assume enumerate is looping through the items in some fashion so thought I could inject logging or printing within that loop. I know this is 'hacky' but it will work fine for me for now since it is temporary while I run tests.
I can't find where the function runs from (I did find an Enumerate class in the numba module. I tried printing in there and it did not work). I know it is part of the __builtin__ module and am having trouble tracking that down as well and tried several techniques to find its exact location such as printing __builtin__.__file__ but to no avail.
My question is, a) can you all help me identify where the function lives and if this is a good approach? or b) if there is a better approach for this.
Thanks for your help!
I suggest you to use tqdm module. tqdm
tqdm module is used for visualization of the progress of an enumeration. It works with iterables (length known or unknown).
By extending tqdm class you can log required information after any required number of iterations. You can find more information in the tqdm class docstring.

Python : counting module imports?

I am a mid end python developer at an animation studio, and have been presented with a unique diagnostics request ;
To assess what code gets used and what doesn't.
Within the sprawling disorganized structure of Python modules importing modules :
I need to count the python modules that are imported, and possibly at a deeper level, find which methods are called.
As far as finding out which methods are called, I think that would be easy to solve by writing my own logging metaclass.
However, I'm at loss to imagine how I should count or log module imports at varying depths.
Thanks for any ideas you may have.
If you have a way to exercise the code, you can run the code under coverage.py. It's normally used for testing, but its basic function would work here: it indicates what lines of code are run and what are not.

Mocking Python iterables for use with Sphinx

I'm using Sphinx to document a project that depends on wxPython, using the autodocs extension so that it will automatically generate pages from our docstrings. The autodocs extension automatically operates on every module you import, which is fine for our packages but is a problem when we import a large external library like wxPython. Thus, instead of letting it generate everything from wxPython I'm using the unittest.mock library module (previously the external package Mock). The most basic setup works fine for most parts of wxPython, but I've run into a situation I can't see an easy way around (likely because of my relative unfamiliarity with mock until this week).
Currently, the end of my conf.py file has the following:
MOCK_MODULES = ['wx.lib.newevent'] # I've skipped irrelevant entries...
for module_name in MOCK_MODULES:
sys.modules[module_name] = mock.Mock()
For all the wxPython modules but wx.lib.newevent, this works perfectly. However, here I'm using the newevent.NewCommandEvent() function[1] to create an event for a particular scenario. In this case, I get a warning on the NewCommandEvent() call with the note TypeError: 'Mock' object is not iterable.
While I can see how one would use patching to handle this for building out unit tests (which I will be doing in the next month!), I'm having a hard time seeing how to integrate that at a simple level in my Sphinx configuration.
Edit: I've just tried using MagicMock() as well; this still produces an error at the same point, though it now produces ValueError: need more than 0 values to unpack. That seems like a step in the right direction, but I'm still not sure how to handle this short of explicitly setting it up for this one module. Maybe that's the best solution, though?
Footnotes
Yes, that's a function, naming convention making it look like a class notwithstanding; wxPython follows the C++ naming conventions which are used throughout the wxWidgets toolkit.
From the error, it looks like it is actually executing newevent.NewCommandEvent(), so I assume that somewhere in your code you have a top-level line something like this:
import wx.lib.newevent
...
event, binder = wx.lib.newevent.NewCommandEvent()
When autodoc imports the module, it tries to run this line of code, but since NewCommandEvent is actually a Mock object, Python can't bind its output to the (event, binder) tuple. There are two possible solutions. The first is to change your code to that this is not executed on import, maybe by wrapping it inside if __name__ == '__main__'. I would recommend this solution because creating objects like this on import can often have preblematic side effects.
The second solution is to tell the Mock object to return appropriate values thus:
wx.lib.newevent.NewCommandEvent = mock.Mock(return_value=(Mock(), Mock()))
However, if you are doing anything in your code with the returned values you might run into the same problem further down the line.

Is there a way to automatically verify all imports in a Python script without running it?

Suppose that I have a somewhat long Python script (too long to hand-audit) that contains an expensive operation, followed by a bunch of library function calls that are dependent on the output of the expensive operation.
If I have not imported all the necessary modules for the library function calls, then Python will error out only after the expensive operation has finished, because Python interprets line by line.
Is there a way to automatically verify that I have all the necessary imports without either a) manually verifying it line by line or b) running through the expensive operation each time I miss a library?
Another way to put that question is whether there is a tool that will do what the C compiler does with respect to verifying dependencies before run time.
No, this is not possible, because dependencies can be injected at runtime.
Consider:
def foo(break_things):
if not break_things:
globals()['bar'] = lambda: None
long_result = ...
foo(long_result > 0)
bar()
Which depending on the runtime value of long_result, may give NameError: name 'bar' is not defined.
There is a module called snakefood:
Generate dependency graphs from Python code
It uses the AST to parse
the Python files.
This is very reliable, it always runs. No module is
loaded. Loading modules to figure out dependencies is almost always
problem, because a lot of codebases run initialization code in the
global namespace, which often requires additional setup. Snakefood is
guaranteed not to have this problem (it just runs, no matter what).
You can get a list of imports by calling sfood-imports <script.py>. Then you can import each module in the list one by one and see if it throws ImportError.
Or, just use pylint. Quote from docs:
Error detection
checking if declared interfaces are truly implemented
checking if modules are imported
Hope that helps.

how do you statically find dynamically loaded modules

How does one get (finds the location of) the dynamically imported modules from a python script ?
so, python from my understanding can dynamically (at run time) load modules.
Be it using _import_(module_name), or using the exec "from x import y", either using imp.find_module("module_name") and then imp.load_module(param1, param2, param3, param4) .
Knowing that I want to get all the dependencies for a python file. This would include getting (or at least I tried to) the dynamically loaded modules, those loaded either by using hard coded string objects or those returned by a function/method.
For normal import module_name and from x import y you can do either a manual scanning of the code or use module_finder.
So if I want to copy one python script and all its dependencies (including the custom dynamically loaded modules) how should I do that ?
You can't; the very nature of programming (in any language) means that you cannot predict what code will be executed without actually executing it. So you have no way of telling which modules could be included.
This is further confused by user-input, consider: __import__(sys.argv[1]).
There's a lot of theoretical information about the first problem, which is normally described as the Halting problem, the second just obviously can't be done.
From a theoretical perspective, you can never know exactly what/where modules are being imported. From a practical perspective, if you simply want to know where the modules are, check the module.__file__ attribute or run the script under python -v to find files when modules are loaded. This won't give you every module that could possibly be loaded, but will get most modules with mostly sane code.
See also: How do I find the location of Python module sources?
This is not possible to do 100% accurately. I answered a similar question here: Dependency Testing with Python
Just an idea and I'm not sure that it will work:
You could write a module that contains a wrapper for __builtin__.__import__. This wrapper would save a reference to the old __import__and then assign a function to __builtin__.__import__ that does the following:
whenever called, get the current stacktrace and work out the calling function. Maybe the information in the globals parameter to __import__ is enough.
get the module of that calling functions and store the name of this module and what will get imported
redirect the call the real __import__
After you have done this you can call your application with python -m magic_module yourapp.py. The magic module must store the information somewhere where you can retrieve it later.
That's quite of a question.
Static analysis is about predicting all possible run-time execution paths and making sure the program halts for specific input at all.
Which is equivalent to Halting Problem and unfortunately there is no generic solution.
The only way to resolve dynamic dependencies is to run the code.

Categories