Reset global variables in timeit.repeat - python

Scenario
Let test be the module we run as __main__. This module contains one global variable named primes, which is initialized in the module with the following assignment.
primes = []
The module also contains a function named pi, which alters this global variable:
def pi(n):
global primes
"""Some code that modifies the global 'primes' variable"""
I then want to time said function using the builtin timeit module. I want to use the timeit.repeat function and get the minimum value of the timing, as a way of improving the measurement's accuracy (instead of measuring just one time, which may be subject to slow-down due to unrelated processes).
print(min(timeit.repeat('test.pi(50000)',
setup="import test",
number=1, repeat=10)) * 1000)
The problem is that the pi function behaves differently depending on the value of primes: I expected that, for each repetition, the import test statement in the setup parameter would re-run the primes = [] statement in the test, thus 'resetting' primes so that the code being executed would be identical for each repetition. But, instead, the value of primes that resulted from the previous execution is used, so I had to add the statement test.primes = [] to the setup parameter:
print(min(timeit.repeat('test.pi(50000)',
setup="import test \n" + "test.primes = []",
number=1, repeat=10)) * 1000)
Question
This leads me to the question: is there a direct way (i.e. in one statement) to 'reset' the values of all the global variables to what they were when they were first assigned in the module?
In this specific scenario adding that one statement to manually 'reset' primes works fine, but consider a case in which there are a lot of global variables, and you want to 'reset' all of them.
Side quest-ion
Why doesn't the statement import test re-run the initial primes = [] assignment?

Let's start with your side question, because it turns out that it's actually central to everything:
Why doesn't the statement import test re-run the initial primes = [] assignment?"
Because, as explained in the docs on the import system and the import statement, what import test does is, loosely, this pseudocode:
if 'test' not in sys.modules:
find, load (compiling if needed), and exec the module
sys.modules['test'] = result
test = sys['test.modules']
OK, but why does it do that?
If you have two modules that both import the same module, they expect to see the same globals. And remember that types, functions, etc. defined at the top level of a function are all globals. For example, if sortedlist.py imports collections.abc to class SortedList(collections.abc.Sequence):, and scraper.py imports collections.abc to isinstance(something, collections.abc.Sequence), you'd want a SortedList to pass that test—but it won't if those are two completely independent types because they came from two different module objects that happen to have the same name,
If you have 12 modules that all import pandas as pd, you'd be running all the Pandas initialization code 12 times. Except that some of your modules also probably import each other, so they'd each be run multiple times, and import Pandas each time. How long do you think it would take to run all the Pandas initialization 60 times?
So, reusing existing modules is almost always what you want.
And when you don't, that's usually a sign that there's something wrong with your design (which may well be the case here).
But "almost always" isn't "always". So there are ways around it. None of them are usually a good idea for live code, but for things like unit tests and benchmarking, there are three basic options that are all fine, as long as the tradeoffs are the ones you want:
del sys.modules['test']. This is obviously pretty hacky, but it actually does exactly what you want here. Any existing references to the old module are completely untouched, but the next time anyone does import test, they're going to get a brand-new test module.
importlib.reload(test). This sounds great, but it may on the one hand be overkill (notice that it forces the module source to be recompiled, which you don't need), while on the other it may not be sufficient (it doesn't actually reset the globals—if your code does primes = [] at the top level, that line gets executed, so who cares, but if your code instead does, say, globals().setdefault('primes', []) inside the pi function, you care).
Instead of import test, manually do all the steps up through executing the module (see the examples in the importlib docs), but don't store it in sys.modules['test'] or in test, just store it in a local variable you discard after each test. This is probably the cleanest, although it does mean 6 lines of code instead of 1.

Related

Difference Between np.random.uniform() and uniform() using built-in python packages

I'm using np.random.uniform() to generate a number in a class. Surprisingly, when I run the code, I can't see any expected changes in my results. On the other hand, when I use uniform() from python built-in packages, I see the changes in my results and that's obviously normal.
Are they really the same or is there anything tricky in their implementation?
Thank you in advance!
Create one module, say, blankpaper.py, with only two lines of code
import numpy as np
np.random.seed(420)
Then, in your main script, execute
import numpy as np
import blankpaper
print(np.random.uniform())
You should be getting exactly the same numbers.
When a module or library sets np.random.seed(some_number), it is global. Behind the numpy.random.* functions is an instance of the global RandomState generator, emphasis on the global.
It is very likely that something that you are importing is doing the aforementioned.
Change the main script to
import numpy as np
import blankpaper
rng = np.random.default_rng()
print(rng.uniform())
and you should be getting new numbers each time.
default_rng is a constructor for the random number class, Generator. As stated in the documentation,
This function does not manage a default global instance.
In reply to the question, "[a]re you setting a seed first?", you said
Yes, I'm using it but it doesn't matter if I don't use a seed or
change the seed number. I checked it several times.
Imagine we redefine blankpaper.py to contain the lines
import numpy as np
def foo():
np.random.seed(420)
print("I exist to always give you the same number.")
and suppose your main script is
import numpy as np
import blankpaper
np.random.seed(840)
blankpaper.foo()
print(np.random.uniform())
then you should be getting the same numbers as were obtained from executing the first main script (top of the answer).
In this case, the setting of the seed is hidden in one of the functions in the blankpaper module, but the same thing would happen if blankpaper.foo were a class and blankpaper.foo's __init__() method set the seed.
So this setting of the global seed can be quite "hidden".
Note also that the above also applies for the functions in the random module
The functions supplied by this module are actually bound methods of a
hidden instance of the random.Random class. You can instantiate your
own instances of Random to get generators that don’t share state.
So when uniform() from the random module was generating different numbers each time for you, it was very likely because you nor some other module set the seed shared by functions from the random module.
In both numpy and random, if your class or application wants to have it's own state, create an instance of Generator from numpy or Random from random (or SystemRandom for cryptographically-secure randomness). This will be something you can pass around within your application. It's methods will be the functions in the numpy.random or random module, only they will have their own state (unless you explicitly set them to be equal).
Finally, I am not claiming that this is exactly what is causing your problem (I had to make a few inferences since I cannot see your code), but this is a very likely reason.
Any questions/concerns please let me know!

GridSearchCV: print some expression each time a function completes a loop

Assume you have some function function in Python that works by looping: for example it could be a function that evaluates a certain mathematical expression, e.g. x**2, for all elements from an array, e.g. ([1, 2, ..., 100]) (obviously this is a toy example). Would it be possible to write a code such that, each time function goes through a loop and obtains a result, some code is executed, e.g. print("Loop %s has been executed" % i)? So, in our example, when x**1 has been computed, the program prints Loop 1 has been executed, then when x**2 has been computed, it prints Loop 2 has been executed, and so on.
Note that the difficulty comes from the fact that I do not program the function, it is a preexisting function from some package (more specifically, the function I am interested in would be GridSearchCV from package scikit learn).
The easiest way to do this would be to just copy the function's code into your own function, tweak it, and then use it. In your case, you would have to subclass GridSearchCV and override the _fit method. The problem with this approach is that it may not survive a package upgrade.
In your case, that's not necessary. You can just specify a verbosity level when creating the object:
GridSearchCV(verbose=100)
I'm not entirely sure what the verbosity number itself means. Here's the documentation from the package used internally that does the printing:
The verbosity level: if non zero, progress messages are printed. Above 50, the output is sent to stdout. The frequency of the messages increases with the verbosity level. If it more than 10, all iterations are reported.
You can look at the source code if you really want to know what the verbosity number does. I can't tell.
You could potentially use monkey-patching ("monkey" because it's hacky)
Assuming the library function is
def function(f):
for i in range(100):
i**2
and you want to enter a print statement, you would need to copy the entire function into your own file, and make your tiny edit:
def my_function(f):
for i in range(100):
i**2
print ("Loop %s" % i)
Now you overwrite the library function:
from library import module
module.existing_function = my_function
Obviously this is not an easily maintainable solution (if your target library is upgraded, you might have to go through this process again), so make sure you use it only for temporary debugging purposes.

Unexpected performance loss when calling Cython function within Python script?

So I have a time-critical section of code within a Python script, and I decided to write a Cython module (with one function -- all I need) to replace it. Unfortunately, the execution speed of the function I'm calling from the Cython module (which I'm calling within my Python script) isn't nearly as fast as I tested it to be in a variety of other scenarios. Note that I CANNOT share the code itself because of contract law! See the following cases, and take them as an initial description of my issue:
(1) Execute Cython function by using the Python interpreter to import the module and run the function. Runs relatively quickly (~0.04 sec on ~100 separate tests, versus original ~0.24 secs).
(2) Call Cython function within Python script at 'global' level (i.e. not inside any function). Same speed as case (1).
(3) Call Cython function within Python script, with Cython function inside my Python script's main function; tested with the Cython function in global and local namespaces, all with the same speed as case (1).
(4) Same as (3), but inside a simple for-loop within said Python function. Same speed as case (1).
(5) problem! Same as (4), but inside yet another for-loop: Cython function's execution time (whether called globally or locally) balloons to ~10 times that of the other cases, and this is where I need the function to get called. Nothing odd to report about this loop, and I tested all of the components of this loop (adjusting/removing what I could). I also tried using a 'while' loop for giggles, to no avail.
"One thing I've yet to try is making this inner-most loop a function and going from there." EDIT: Just tried this- no luck.
Thanks for any suggestions you have- I deeply regret not being able to share my code...it hurts my soul a little, but my client just can't have this code floating around. Let me know if there is any other information that I can provide!
-The Real Problem and an Initial (ugly) Solution-
It turns out that the best hint in this scenario was the obvious one (as usual): it wasn't the for-loop that was causing the problem; why would it? After a few more tests, it became obvious that something about the way I was calling my Cython function was wrong, because I could call it elsewhere (using an input variable different from the one going to the 'real' Cython function) without the performance loss issue.
The underlying issue: data types. I wrote my Cython function to expect a list full of standard floats. Unfortunately, my code did this:
function_input = list(numpy_array_containing_npfloat64_data) # yuck.
type(function_input[0]) = numpy.float64
output = Cython_Function(function_input)
inside the Cython function:
def Cython_Function(list function_input):
cdef many_vars
"""process lots of vars expecting C floats""" # Slowness from converting numpy.float64's --> floats???
type(output) = list
return output
I'm aware that I can play around more with types in the Cython function, which I very well may do to prevent having to 'list' an existing numpy array. Anyway, here is my current solution:
function_input = [float(x) for x in function_input]
I welcome any feedback and suggestions for improvement. The function_input numpy array doesn't really need the precision of numpy.float64, but it does get used a few times before getting passed to my Cython function.
It could be that, while individually, each function call with the Cython implementation is faster than its corresponding Python function, there is more overhead in the Cython function call because it has to look up the name in the module namespace. You can try assigning the function to a local callable first, for example:
from module import function
def main():
my_func = functon
for i in sequence:
my_func()
If possible, you should try to include the loops within the Cython function, which would reduce the overhead of a Python loop to the (very minimal) overhead of a compiled C loop. I understand that it might not be possible (i.e. need references from a global/larger scope), but it's worth some investigation on your part. Good luck!
function_input = list(numpy_array_containing_npfloat64_data)
def Cython_Function(list function_input):
cdef many_vars
I think the problem is in using the numpy array as a list ... can't you use the np.ndarray as input to the Cython function?
def Cython_Function(np.ndarray[dtype=np.float64] input):
....

Storing a list of modules

I recently learned of the __ import__ function and found that I could store a module in a variable, so I was thinking of making a list of modules and then calling the appropriate one when necessary.
So I might have three modules test1, test2, test3, each containing a single function "print_hello" that simply prints "hello, I'm [module name]"
At runtime, I would call some function to import those modules and put them in a list.
Then I would pick a random number between 0 and 2 inclusively, pick that module from the list, and print hello.
#run function to import each module, resulting in the following list
#my_modules = [module1, module2, module3]
#generate some number i
chosen_module = my_modules[i]
chosen_module.print_hello()
Is this acceptable coding practice? Or are there any reasons that would discourage this?
I use this sort of approach in some of my testing code. I want to test output from one version of a module against a different versions of the same module. Being able to iterate over different module instances makes the code cleaner.
But this sort of code is the exception to the rule. It's very infrequently that this approach is the cleanest solution to a problem.

When are function local python variables created?

When are function-local variables are created? For example, in the following code is dictionary d1 created each time the function f1 is called or only once when it is compiled?
def f1():
d1 = {1: 2, 3: 4}
return id(d1)
d2 = {1: 2, 3: 4}
def f2():
return id(d2)
Is it faster in general to define a dictionary within function scope or to define it globally (assuming the dictionary is used only in that function). I know it is slower to look up global symbols than local ones, but what if the dictionary is large?
Much python code I've seen seems to define these dictionaries globally, which would seem not to be optimal. But also in the case where you have a class with multiple 'encoding' methods, each with a unique (large-ish) lookup dictionary, it's awkward to have the code and data spread throughout the file.
Local variables are created when assigned to, i.e., during the execution of the function.
If every execution of the function needs (and does not modify!-) the same dict, creating it once, before the function is ever called, is faster. As an alternative to a global variable, a fake argument with a default value is even (marginally) faster, since it's accessed as fast as a local variable but also created only once (at def time):
def f(x, y, _d={1:2, 3:4}):
I'm using the name _d, with a leading underscore, to point out that it's meant as a private implementation detail of the function. Nevertheless it's a bit fragile, as a bumbling calles might accidentally and erroneously pass three arguments (the third one would be bound as _d within the function, likely causing bugs), or the function's body might mistakenly alter _d, so this is only recommended as an optimization to use when profiling reveals it's really needed. A global dict is also subject to erroneous alterations, so, even though it's faster than buiding a local dict afresh on every call, you might still pick the latter possibility to achieve higher robustness (although the global dict solution, plus good unit tests to catch any "oops"es in the function, are the recommended alternative;-).
If you look at the disassembly with the dis module you'll see that the creation and filling of d1 is done on every execution. Given that dictionaries are mutable this is unlikely to change anytime soon, at least until good escape analysis comes to Python virtual machines. On the other hand lookup of global constants will get speculatively optimized with the next generation of Python VM's such as unladen-swallow (the speculation part is that they are constant).
Speed is relative to what you're doing. If you're writing a database-intensive application, I doubt your application is going to suffer one way or another from your choice of global versus local variables. Use a profiler to be sure. ;-)
As Alex noted, the locals are initialized when the function is called. As easy way to demonstrate this for yourself:
import random
def f():
d = [random.randint(1, 100), random.randint(100, 1000)]
print(d)
f()
f()
f()

Categories