Faster implementation of combinations_with_replacement from itertools? - python

In Python, there is a module called itertools which offers many functions for iteration, including among them some functions for combinatorics. One such function of particular interest to me is combinations_with_replacement. As the name suggests, this provides you with combinations taken from a multiset, with replacement.
For instance, for the multiset (a, b, c), choosing 3 would yield:
aaa, aab, aac, abb, abc, bcc, cbb, acc, bbb, ccc.
Many functions from the itertools module have found faster implementations in NumPy (for instance the more vanilla combinations, here), and I was wondering if such an implementation could be devised for this function as well.

Related

Optimizing Itertools Results in Python

I am calling itertools in python (see below). In this code, snp_dic is a dictionary with integer keys and sets as values. The goal here is to find the minimum list of keys whose union of values is a combination of unions of sets that is equivalent to the set_union. (This is equivalent to solving for a global optimum for the popular NP-hard graph theory problem set-cover for those of you interested)! The algorithm below works but the goal here is optimization.
The most obvious optimization I see has to do with itertools. Let's say for a length r, there exists a combination of r sets in snp_dic whose union = set_union. Basic probability dictates that if this combination exists and is distributed somewhere uniformly at random over the combinations, it is expected to on average only have to iterate over have the combinations to find this set-covering combination. Itertools however will return all the possible combinations, taking twice as long as the expected time of checking set_unions by checking at each iteration.
A logical solution would seem to be simply by to implement itertools.combinations() locally. Based on the "equivalent" python implementation of itertools.combinations() in the python docs however the time is approximately twice as slow because itertools.combinations calls a C level implementation rather than a python-native one.
The question (finally) is then, how can I stream the results of itertools.combinations() one by one so I can check set unions as I go along so it still runs at a near equivalent time as the python implementation of itertools.combinations(). In an answer I would appreciate if you could include the results of timing your new method to prove it runs at a similar time as the python-native implementation. Any other optimizations also appreciated.
def min_informative_helper(snp_dic, min, set_union):
union = lambda set_iterable : reduce(lambda a,b: a|b, set_iterable) #takes the union of sets
for i in range(min, len(snp_dic)):
combinations = itertools.combinations(snp_dic, i)
combinations = [{i:snp_dic[i] for i in combination} for combination in combinations]
for combination in combinations:
comb_union = union(combination.values())
if(comb_union == set_union):
return combination.keys()
itertools provides generators for the things it returns. To stream them simply use
for combo in itertools.combinations(snp_dic, i):
... remainder of your logic
The combinations method returns one new element each time you access it: one per loop iteration.

random.choice gives different results on Python 2 and 3

Background
I want to test my code which depends on random module.
The problematic PR is https://github.com/Axelrod-Python/Axelrod/pull/202 and code is here https://github.com/Axelrod-Python/Axelrod/blob/master/axelrod/strategies/qlearner.py
The problem
Since random module produces pseudo-random numbers, I always set random.seed(X) to known value X. This works for consecutive test runs. However, Python 3 seems to give different numbers than Python 2 when using random.choice([D, C])
Following snippet:
import random
random.seed(1)
for i in range(10):
print(random.choice(['C', 'D']), end=', ')
gives different result for Python 2 and 3
$ python2 test.py
C, D, D, C, C, C, D, D, C, C
$ python3 test.py
C, C, D, C, D, D, D, D, C, C
However, random.random method works the same on 2.x and 3.x:
import random
random.seed(1)
for i in range(10):
print(random.random())
$ python3 test.py
0.13436424411240122
0.8474337369372327
0.763774618976614
0.2550690257394217
0.49543508709194095
0.4494910647887381
0.651592972722763
0.7887233511355132
0.0938595867742349
0.02834747652200631
$ python2 test.py
0.134364244112
0.847433736937
0.763774618977
0.255069025739
0.495435087092
0.449491064789
0.651592972723
0.788723351136
0.0938595867742
0.028347476522
Workaround
I can mock the output of random.choice, which works well for simple test cases. However, for fairly complicated test cases, I'm not able to mock output, because I simply don't know how it should look like.
The question
Have I done something wrong when calling random.choice method?
There is a completely different implementation of random.choice in each version.
Python 2.7:
def choice(self, seq):
"""Choose a random element from a non-empty sequence."""
return seq[int(self.random() * len(seq))] # raises IndexError if seq is empty
https://hg.python.org/cpython/file/2.7/Lib/random.py
Python 3.4:
def choice(self, seq):
"""Choose a random element from a non-empty sequence."""
try:
i = self._randbelow(len(seq))
except ValueError:
raise IndexError('Cannot choose from an empty sequence')
return seq[i]
https://hg.python.org/cpython/file/3.4/Lib/random.py
The _randbelow method may call random() more than once, or may call getrandbits which has a different underlying call to _urandom.
According to https://docs.python.org/2/library/random.html, the RNG was changed in Python 2.4 and may use operating system resources. Based on this and the other answer to this question, it's not reasonable to expect Random to give the same result on two different versions of Python, two different operating systems, or even two different computers. For all any of us knows, the next version of Python could implement a Random function that uses the system's microphone to generate a random sequence.
Short version: you should never depend on a RNG to give a deterministic result. If you need a known sequence to satisfy a unit test, you need to either redesign your method or your unit test.
One way you might do this is to split your method into two parts: one part generates the random number. The second part consumes the value and acts on it. You would then write two unit tests: one to test coverage of the generated values, and a separate one to test the output of your method based on specific inputs.
Another way might be to change your method to output not just the result but the random number that created that result. You can modify your unit test to compare the two and pass or fail the test based on the expected output of known pairs.
Or perhaps your unit test can be modified to simply run the test n number of times and look for a spread that confirms some sort of randomness.
I've got the exact same problem and I'm disappointed at the number of responses that point to error on your part when, seeding the random function is expected to produce reliably consistent results across versions of Python, machine and operating system.
What seems to work is, painfully, to have your own random class and override the relevant methods with the logic from Python 2.7.
from random import Random
class MyRandom(Random):
def sample(self, population, k):
(code from Python 2.7.6 random module updated for Python 3 syntax)
def choice...
my_random = MyRandom(0)
my_random.sample(['Apples', 'Bananas', 'Carrots'])
The randomness functions themselves are different so having the same seed result in the same numbers doesn't fix many random functions that refuse to return the same results. While there are reasons for the newer random functions, those reasons are moot on existing code bases already dependant on the older functions.
Anyway, I hope this can help anyone else fighting this issue.

How to quickly compute a hash for a collection of objects?

Consider a function f(*x) which takes a lot of arguments *x. Based on these arguments (objects), the function f composes a rather complex object o and returns it. o implements __call__, so o itself serves as a function. Since the composition of o is pretty time consuming and in my scenario there is no point in having multiple instances of o based on the same arguments *x, they are to be cached.
The question is now: How to efficiently compute a hash based on multiple arguments *x? Currently I am using a python dictionary, and i concatenate the str() representations of each x to build each key. It works in my scenario, but it feels rather awkward. I need to call the resulting objects o in a very high frequency, so I suspect the repeated call of str() and the string concatenations waste a lot of computation time.
You can use the hash built-in function, combining the hashes of the items in x together. The typical way to do this (see e.g. the documentation) would be an xor across all the hashes of the individual objects:
it is advised to somehow mix together (e.g. using exclusive or) the hash values for the components of the object that also play a part in comparison of objects
To implement that in a functional way, using operator and reduce:
from functools import reduce # only required in Python 3.x
from operator import xor
def hashed(*x):
return reduce(xor, map(hash, x))
See also this question.
Since version 3.2, Python already contains an implementation of an LRU cache that you can use
to cache your functions' results based on their arguments:
functools.lru_cache
Example:
from functools import lru_cache
#lru_cache(maxsize=32)
def f(*args):
"""Expensive function
"""
print("f(%s) has been called." % (args, ))
return sum(args)
print(f(1, 2, 3))
print(f(1, 2, 3, 4))
print(f(1, 2, 3))
print(f.cache_info())
Output:
f((1, 2, 3)) has been called.
6
f((1, 2, 3, 4)) has been called.
10
6
CacheInfo(hits=1, misses=2, maxsize=32, currsize=2)
(Notice how f(1, 2, 3) only got called once)
As suggested in the comments, it's probably best to simply use the hash()es of your arguments to build the cache-key for your arguments - that's what lru_cache already does for you.
If you're still on Python 2.7, Raymond Hettinger has posted some recipes with LRU caches that you could use in your own code.

Which is faster and why? Set or List?

Lets say that I have a graph and want to see if b in N[a]. Which is the faster implementation and why?
a, b = range(2)
N = [set([b]), set([a,b])]
OR
N= [[b],[a,b]]
This is obviously oversimplified, but imagine that the graph becomes really dense.
Membership testing in a set is vastly faster, especially for large sets. That is because the set uses a hash function to map to a bucket. Since Python implementations automatically resize that hash table, the speed can be constant (O(1)) no matter the size of the set (assuming the hash function is sufficiently good).
In contrast, to evaluate whether an object is a member of a list, Python has to compare every single member for equality, i.e. the test is O(n).
It all depends on what you're trying to accomplish. Using your example verbatim, it's faster to use lists, as you don't have to go through the overhead of creating the sets:
import timeit
def use_sets(a, b):
return [set([b]), set([a, b])]
def use_lists(a, b):
return [[b], [a, b]]
t=timeit.Timer("use_sets(a, b)", """from __main__ import use_sets
a, b = range(2)""")
print "use_sets()", t.timeit(number=1000000)
t=timeit.Timer("use_lists(a, b)", """from __main__ import use_lists
a, b = range(2)""")
print "use_lists()", t.timeit(number=1000000)
Produces:
use_sets() 1.57522511482
use_lists() 0.783344984055
However, for reasons already mentioned here, you benefit from using sets when you are searching large sets. It's impossible to tell by your example where that inflection point is for you and whether or not you'll see the benefit.
I suggest you test it both ways and go with whatever is faster for your specific use-case.
Set ( I mean a hash based set like HashSet) is much faster than List to lookup for a value. List has to go sequentially to find out if the value exists. HashSet can directly jump and locate the bucket and look up for a value almost in a constant time.

Does this function have to use reduce() or is there a more pythonic way?

If I have a value, and a list of additional terms I want multiplied to the value:
n = 10
terms = [1,2,3,4]
Is it possible to use a list comprehension to do something like this:
n *= (term for term in terms) #not working...
Or is the only way:
n *= reduce(lambda x,y: x*y, terms)
This is on Python 2.6.2. Thanks!
reduce is the best way to do this IMO, but you don't have to use a lambda; instead, you can use the * operator directly:
import operator
n *= reduce(operator.mul, terms)
n is now 240. See the docs for the operator module for more info.
Reduce is not the only way. You can also write it as a simple loop:
for term in terms:
n *= term
I think this is much more clear than using reduce, especially when you consider that many Python programmers have never seen reduce and the name does little to convey to people who see it for the first time what it actually does.
Pythonic does not mean write everything as comprehensions or always use a functional style if possible. Python is a multi-paradigm language and writing simple imperative code when appropriate is Pythonic.
Guido van Rossum also doesn't want reduce in Python:
So now reduce(). This is actually the one I've always hated most, because, apart from a few examples involving + or *, almost every time I see a reduce() call with a non-trivial function argument, I need to grab pen and paper to diagram what's actually being fed into that function before I understand what the reduce() is supposed to do. So in my mind, the applicability of reduce() is pretty much limited to associative operators, and in all other cases it's better to write out the accumulation loop explicitly.
There aren't a whole lot of associative operators. (Those are operators X for which (a X b) X c equals a X (b X c).) I think it's just about limited to +, *, &, |, ^, and shortcut and/or. We already have sum(); I'd happily trade reduce() for product(), so that takes care of the two most common uses. [...]
In Python 3 reduce has been moved to the functools module.
Yet another way:
import operator
n = reduce(operator.mul, terms, n)

Categories