Logging progress of `enumerate` function in Python - python

I am working on optimizing a Python (version 2.7; constrained due to old modules I need to use) script that is running on AWS and trying to identify exactly how many resources I need in the environment I am building. In order to do so, I am logging quite a bit of information to the console when the script runs and benchmarking different resource configurations.
One of the bottlenecks is a list with 32,000 items that runs through the Python enumerate function. This takes quite a while to run and my script is blind to its progress until it finishes. I assume enumerate is looping through the items in some fashion so thought I could inject logging or printing within that loop. I know this is 'hacky' but it will work fine for me for now since it is temporary while I run tests.
I can't find where the function runs from (I did find an Enumerate class in the numba module. I tried printing in there and it did not work). I know it is part of the __builtin__ module and am having trouble tracking that down as well and tried several techniques to find its exact location such as printing __builtin__.__file__ but to no avail.
My question is, a) can you all help me identify where the function lives and if this is a good approach? or b) if there is a better approach for this.
Thanks for your help!

I suggest you to use tqdm module. tqdm
tqdm module is used for visualization of the progress of an enumeration. It works with iterables (length known or unknown).
By extending tqdm class you can log required information after any required number of iterations. You can find more information in the tqdm class docstring.

Related

Most Performant Way To Do Imports

From a performance point of view (time or memory) is it better to do:
import pandas as pd
or
from pandas import DataFrame, TimeSeries
Does the best thing to depend on how many classes I'm importing from the package?
Similarly, I've seen people do things like:
def foo(bar):
from numpy import array
Why would I ever want to do an import inside a function or method definition? Wouldn't this mean that import is being performed every time that the function is called? Or is this just to avoid namespace collisions?
This is micro-optimising, and you should not worry about this.
Modules are loaded once per Python process. All code that then imports only need to bind a name to the module or objects defined in the module. That binding is extremely cheap.
Moreover, the top-level code in your module only runs once too, so the binding takes place just once. An import in a function does the binding each time the function is run, but again, this is so cheap as to be negligible.
Importing in a function makes a difference for two reasons: it won't put that name in the global namespace for the module (so no namespace pollution), and because the name is now local, using that name is slightly faster than using a global.
If you want to improve performance, focus on code that is being repeated many, many times. Importing is not it.
Answering the more general question of when to import, imports are dependancies. It is code that may-or-may-not exist, that is required for the functioning of the program. It is therefore, a very good idea to import that code as soon as possible to prevent dumb errors from cropping up in the middle of execution.
This is particularly true as pypy becomes more popular, when the import might exist but isn't usable via pypy. Far better to fail early, than potentially hours into the execution of the code.
As for "import pandas as pd" vs "from pandas import DataFrame, TimeSeries", this question has multiple concerns (as all questions do), with some far more important than others. There's the question of namespace, there's the question of readability, and there's the question of performance. Performance, as Martjin states, should contribute to about 0.0001% of the decision. Readability should contribute about 90%. Namespace only 10%, as it can be mitigated so easily.
Personally, in my opinion, both import X as Y and form X import Y is bad practice, because explicit is better than implicit. You don't want to be on line 2000 trying to remember which package "calculate_mean" comes from because it isn't referenced anywhere else in the code. When i first started using numpy I was copy/pasting code from the internet, and couldn't figure out why i didn't/couldn't pip install np. This obviously isn't a problem if you have pre-existing knowledge that "np" is python for "numpy", but it's a stupid and pointless confusion for the 3 letters it saves. It came from numpy. Use numpy.
There is an advantage of importing a module inside of a function that hasn't been mentioned yet: doing so gives you some control over when the module is loaded. In fact, even though #J.J's answer recommends importing all modules as early as possible, this control allows you to postpone loading the module.
Why would you want to do that? Well, while it doesn't improve the actual performance of your program, doing so can improve the perceived performance, and by virtue of this, the user experience:
In part, users perceive whether your app is fast or slow based on how long it takes to start up.
MSDN: Best practices for your app's startup performance
Loading every module at the beginning of your main script can take some time. For example, one of my apps uses the Qt framework, Pandas, Numpy, and Matplotlib. If all these modules are imported right at the beginning of the app, the appearance of the user interface is delayed by several seconds. Users don't like to wait, and they are likely to perceive your app as generally slow because of this wait.
But if for example Matplotlib is imported only from within those functions that are called whenever the user issues a plot command, the startup time is notably reduced. The user doesn't perceive your app to be that sluggish anymore, which may result in a better user experience.

How to tell whether a Python function with dependencies has changed?

tl;dr:
How can I cache the results of a Python function to disk and in a later session use the cached value if and only if the function code and all of its dependencies are unchanged since I last ran it?
In other words, I want to make a Python caching system that automatically watches out for changed code.
Background
I am trying to build a tool for automatic memoization of computational results from Python. I want the memoization to persist between Python sessions (i.e. be reusable at a later time in another Python instance, preferrably even on another machine with the same Python version).
Assume I have a Python module mymodule with some function mymodule.func(). Let's say I already solved the problem of serializing/identifying the function arguments, so we can assume that mymodule.func() takes no arguments if it simplifies anything.
Also assume that I guarantee that the function mymodule.func() and all its dependencies are deterministic, so mymodule.func() == mymodule.func().
The task
I want to run the function mymodule.func() today and save its results (and any other information necessary to solve this task). When I want the same result at a later time, I would like to load the cached result instead of running mymodule.func() again, but only if the code in mymodule.func() and its dependencies are unchanged.
To simplify things, we can assume that the function is always run in a freshly started Python interpreter with a minimal script like this:
import some_save_function
import mymodule
result = mymodule.func()
some_save_function(result, 'filename')
Also, note that I don't want to be overly conservative. It is probably not too hard to use the modulefinder module to find all modules involved when running the first time, and then not use the cache if any module has changed at all. But this defeats my purpose, because in my use case it is very likely that some unrelated function in an imported module has changed.
Previous work and tools I have looked at
joblib memoizes results tied to the function name, and also saves the source code so we can check if it is unchanged. However, as far as I understand it does not check upstream functions (called by mymodule.func()).
The ast module gives me the Abstract Syntax Tree of any Python code, so I guess I can (in principle) figure it all out that way. How hard would this be? I am not very familiar with the AST.
Can I use any of all the black magic that's going on inside dill?
More trivia than a solution: IncPy, a finished/deceased research project, implemented a Python interpreter doing this by default, always. Nice idea, but never made it outside the lab.
Grateful for any input!

Python : counting module imports?

I am a mid end python developer at an animation studio, and have been presented with a unique diagnostics request ;
To assess what code gets used and what doesn't.
Within the sprawling disorganized structure of Python modules importing modules :
I need to count the python modules that are imported, and possibly at a deeper level, find which methods are called.
As far as finding out which methods are called, I think that would be easy to solve by writing my own logging metaclass.
However, I'm at loss to imagine how I should count or log module imports at varying depths.
Thanks for any ideas you may have.
If you have a way to exercise the code, you can run the code under coverage.py. It's normally used for testing, but its basic function would work here: it indicates what lines of code are run and what are not.

Benchmarking run times in Python

I have to benchmark JSON serialization time and compare it to thrift and Google's protocol buffer's serialization time. Also it has to be in Python.
I was planning on using the Python profilers.
http://docs.python.org/2/library/profile.html
Would the profiler be the best way to find function runtimes? Or would outputting a timestamp before and after the function call be the better option?
Or is there an even better way?
From the profile docs that you linked to:
Note The profiler modules are designed to provide an execution profile for a given program, not for benchmarking purposes (for that, there is timeit for reasonably accurate results). This particularly applies to benchmarking Python code against C code: the profilers introduce overhead for Python code, but not for C-level functions, and so the C code would seem faster than any Python one.
So, no, you do not want to use profile to benchmark your code. What you want to use profile for is to figure out why your code is too slow, after you already know that it is.
And you do not want to output a timestamp before and after the function call, either. There are just way too many things you can get wrong that way if you're not careful (using the wrong timestamp function, letting the GC run a cycle collection in the middle of your test run, including test overhead in the loop timing, etc.), and timeit takes care of all of that for you.
Something like this is a common way to benchmark things:
for impl in 'mycode', 'googlecode', 'thriftcode':
t = timeit.timeit('serialize(data)',
setup='''from {} import serialize;
with open('data.txt') as f: data=f.read()
'''.format(impl),
number=10000)
print('{}: {}'.format(impl, t)
(I'm assuming here that you can write three modules that wrap the three different serialization tools in the same API, a single serialize function that takes a string and does something or other with it. Obviously there are different ways to organize things.)
You should be careful when you are profiling python code based on a time stamp at the start and end of the problem. This does not take into account other processes that might also be running concurrently.
Instead, you should consider looking at
Is there any simple way to benchmark python script?

What's the best way to record the type of every variable assignment in a Python program?

Python is so dynamic that it's not always clear what's going on in a large program, and looking at a tiny bit of source code does not always help. To make matters worse, editors tend to have poor support for navigating to the definitions of tokens or import statements in a Python file.
One way to compensate might be to write a special profiler that, instead of timing the program, would record the runtime types and paths of objects of the program and expose this data to the editor.
This might be implemented with sys.settrace() which sets a callback for each line of code and is how pdb is implemented, or by using the ast module and an import hook to instrument the code, or is there a better strategy? How would you write something like this without making it impossibly slow, and without runnning afoul of extreme dynamism e.g side affects on property access?
I don't think you can help making it slow, but it should be possible to detect the address of each variable when you encounter a STORE_FAST STORE_NAME STORE_* opcode.
Whether or not this has been done before, I do not know.
If you need debugging, look at PDB, this will allow you to step through your code and access any variables.
import pdb
def test():
print 1
pdb.set_trace() # you will enter an interpreter here
print 2
What if you monkey-patched object's class or another prototypical object?
This might not be the easiest if you're not using new-style classes.
You might want to check out PyChecker's code - it does (i think) what you are looking to do.
Pythoscope does something very similar to what you describe and it uses a combination of static information in a form of AST and dynamic information through sys.settrace.
BTW, if you have problems refactoring your project, give Pythoscope a try.

Categories