What is this set of all sets one line function doing? - python

I found this one line function on the python wiki that creates a set of all sets that can be created from a list passed as an argument.
f = lambda x: [[y for j, y in enumerate(set(x)) if (i >> j) & 1] for i in range(2**len(set(x)))]
Can someone please explain how this function works?

To construct the powerset, iterating over 2**len(set(x)) gives you all the binary combinations of the set.
range(2**len(set(x))) == [00000, 00001, 00010, ..., 11110, 11111]
Now you just need to test if the bit is set in i to see if you need to include it in the set, e.g.:
>>> i = 0b10010
>>> [y for j, y in enumerate(range(5)) if (i >> j) & 1]
[1, 4]
Though I'm not sure how efficient it is given the call to set(x) for every iteration. There is a small hack that would avoid that:
f = lambda x: [[y for j, y in enumerate(s) if (i >> j) & 1] for s in [set(x)] for i in range(2**len(s))]
A couple of other forms using itertools:
import itertools as it
f1 = lambda x: [list(it.compress(s, i)) for s in [set(x)] for i in it.product((0,1), repeat=len(s))]
f2 = lambda x: list(it.chain.from_iterable(it.combinations(set(x), r) for r in range(len(set(x))+1)))
Note: this last one could just return an iterable vs list if you remove list() depending on the use-case this could save some memory.
Looking at some timings of a list of 25 random numbers 0-50:
%%timeit binary: 1 loop, best of 3: 20.1 s per loop
%%timeit binary+hack: 1 loop, best of 3: 17.9 s per loop
%%timeit compress/product: 1 loop, best of 3: 5.27 s per loop
%%timeit chain/combinations: 1 loop, best of 3: 659 ms per loop

Let's rewrite it a bit and break it down step by step:
f = lambda x: [[y for j, y in enumerate(set(x)) if (i >> j) & 1] for i in range(2**len(set(x)))]
is equivalent to:
def f(x):
n = len(set(x))
sets = []
for i in range(n): # all combinations of members of the set in binary
set_i = []
for j, y in enumerate(set(x)):
if (i>>j) & 1: #check if bit nr j is set
set_x.append(y)
sets.append(set_i)
return sets
for an input list like [1,2,3,4], the following happens:
n=4
range(2**n)=[0,1,2,3...15]
which, in binary is:
0,1,10,11,100...1110,1111
Enumerate makes tuples of y with its index, so in our case:
[(0,1),(1,2),(2,3),(3,4)]
The (i>>j) & 1 part might require some explanation.
(i>>j) shifts the number i j places to the right, e.g. in decimal: 4>>2=1, or in binary:100>>2=001. The & is the bit-wise and operator. This checks, for every bit of both operands, if they are 1 and returns the result as a number, acting like a filter: 10111 & 11001 = 10101.
In the case of our example, it checks if the bit at place j is 1. If it is, the corresponding value is added to the result list. This way the binary map of combinations is converted to a list of lists, which is returned.

Related

Fastest way to get pairs of range(n)?

Imagine you want to process all pairs of the numbers 0 to n-1, for example for n = 4 that's these six pairs:
[(0, 1), (0, 2), (0, 3), (1, 2), (1, 3), (2, 3)]
Three ways to create those pairs:
list(combinations(range(n), 2))
[(i, j) for i, j in combinations(range(n), 2)]
[(i, j) for i in range(n) for j in range(i+1, n)]
Benchmark results for n = 1000:
44.1 ms ± 0.2 ms f_combinations_pure
57.7 ms ± 0.3 ms f_combinations
66.6 ms ± 0.1 ms f_ranges
Note I'm not really interested in just storing the pairs (i, j). That's just a minimal example usage of i and j so that we can compare different approaches without much overhead. In reality, you want to do something with i and j, for example [my_string[i:j] for ...] to get substrings (the question where comments inspired this). So the list(combinations(...)) one doesn't really count here, and I show it just to make that clear (although I still liked seeing how fast it is).
Question 1: Why is f_ranges slower than f_combinations? Its for i in runs only n times overall, so it's insignificant compared to the for j in, which runs n*(n-1)/2 times. And for j in range(...) only assigns one number, whereas for i, j in combinations(...) builds and assigns pairs of numbers, so the latter should be slower. Why is it faster?
Question 2: What's the fastest way you can come up with? For fair comparison, it shall be a list comprehension [(i, j) for ...] producing the same list of pairs.
(As I'm including an answer myself (which is encouraged), I'm including benchmark code there.)
About question 1: Why is range slower than combinations?
While for j in range(...) indeed has the advantage of assigning just one number, it has the disadvantage of creating them over and over again. In Python, numbers are objects, and their creation (and deletion) takes a little time.
combinations(...) on the other hand first creates and stores the number objects only once, and then reuses them over and over again in the pairs. You might think "Hold on, it can reuse the numbers, but it produces the pairs as tuple objects, so it also creates one object per iteration!". Well... it has an optimization. It actually reuses the same tuple object over and over again, filling it with different numbers. "What? No way! Tuples are immutable!" Well... ostensibly they're immutable, yes. But if the combinations iterator sees that there are no other references to its result tuple, then it "cheats" and modifies it anyway. At the C code level, it can do that. And if nothing else has a reference to it, then there's no harm. Note that for i, j in ... unpacks the tuple and doesn't keep a reference to it. If you instead use for pair in ..., then pair is a reference to it and the optimization isn't applied and indeed a new result tuple gets created every time. See the source code of combinations_next if you're interested.
About question 2: What's the fastest way?
I found four faster ways:
44.1 ms ± 0.2 ms f_combinations_pure
51.7 ms ± 0.1 ms f_list
52.7 ms ± 0.2 ms f_tee
53.6 ms ± 0.1 ms f_copy_iterator
54.6 ms ± 0.1 ms f_tuples
57.7 ms ± 0.3 ms f_combinations
66.6 ms ± 0.1 ms f_ranges
All four faster ways avoid what made the range solution slow: Instead of creating (and deleting) Θ(n²) int objects, they reuse the same ones over and over again.
f_tuples puts them into a tuple and iterates over slices:
def f_tuples(n):
nums = tuple(range(n))
return [(i, j)
for i in nums
for j in nums[i+1:]]
f_list puts them into a list and then before each j-loop, it removes the first number:
def f_list(n):
js = list(range(n))
return [(i, j)
for i in range(n)
if [js.pop(0)]
for j in js]
f_copy_iterator puts them into a tuple, then uses an iterator for i and a copy of that iterator for j (which is an iterator starting one position after i):
def f_copy_iterator(n):
nums = iter(tuple(range(n)))
return [(i, j)
for i in nums
for j in copy(nums)]
f_tee uses itertools.tee for a similar effect as copy. Its JS is the main iterator of j values, and before each j-loop, it discards the first value and then tees JS to get a second iterator of the remaining values:
def f_tee(n):
return [(i, j)
for JS in [iter(range(n))]
for i in range(n)
for _, (JS, js) in [(next(JS), tee(JS))]
for j in js]
Bonus question: Is it worth it to optimize like those faster ways?
Meh, probably not. Probably you'd best just use for i, j in combinations(...). The faster ways aren't much faster, and they're somewhat more complicated. Plus, in reality, you'll actually do something with i and j (like getting substrings), so the relatively small speed advantage becomes even relatively smaller.
But I hope you at least found this interesting and perhaps learned something new that is useful some day.
Full benchmark code
Try it online!
def f_combinations_pure(n):
return list(combinations(range(n), 2))
def f_combinations(n):
return [(i, j) for i, j in combinations(range(n), 2)]
def f_ranges(n):
return [(i, j) for i in range(n) for j in range(i+1, n)]
def f_tuples(n):
nums = tuple(range(n))
return [(i, j) for i in nums for j in nums[i+1:]]
def f_list(n):
js = list(range(n))
return [(i, j) for i in range(n) if [js.pop(0)] for j in js]
def f_copy_iterator(n):
nums = iter(tuple(range(n)))
return [(i, j) for i in nums for j in copy(nums)]
def f_tee(n):
return [(i, j)
for JS in [iter(range(n))]
for i in range(n)
for _, (JS, js) in [(next(JS), tee(JS))]
for j in js]
fs = [
f_combinations_pure,
f_combinations,
f_ranges,
f_tuples,
f_list,
f_copy_iterator,
f_tee
]
from timeit import default_timer as time
from itertools import combinations, tee
from statistics import mean, stdev
from random import shuffle
from copy import copy
# Correctness
expect = fs[0](1000)
for f in fs:
result = f(1000)
assert result == expect
# Prepare for timing
times = {f: [] for f in fs}
def stats(f):
ts = [t * 1e3 for t in sorted(times[f])[:5]]
return f'{mean(ts):4.1f} ms ± {stdev(ts):3.1f} ms '
# Timing
for i in range(25):
shuffle(fs)
for f in fs:
start = time()
result = f(1000)
end = time()
times[f].append(end - start)
del result
# Results
for f in sorted(fs, key=stats):
print(stats(f), f.__name__)

Most time/space efficient way to check if all elements in list of integers are 0

This is an implementation question for Python 2.7
Say I have a list of integers called nums, and I need to check if all values in nums are equal to zero. nums contains many elements (i.e. more than 10000), with many repeating values.
Using all():
if all(n == 0 for n in set(nums)): # I assume this conversion from list to set helps?
# do something
Using set subtraction:
if set(nums) - {0} == set([]):
# do something
Edit: better way to do the above approach, courtesy of user U9-Forward
if set(nums) == {0}:
# do something
How do the time and space complexities compare for each of these approaches? Is there a more efficient way to check this?
Note: for this case, I am trying to avoid using numpy/pandas.
Any set conversion of nums won't help as it will iterate the entire list:
if all(n == 0 for n in nums):
# ...
is just fine as it stops at the first non-zero element, disregarding the remainder.
Asymptotically, all these approaches are linear with random data.
Implementational details (no repeated function calls on the generator) makes not any(nums) even faster, but that relies on the absence of any other falsy elements but0, e.g. '' or None.
not any(nums) is probably the fastest because it will stop when/if it finds any non-zero element.
Performance comparison:
a = range(10000)
b = [0] * 10000
%timeit not any(a) # 72 ns, fastest for non-zero lists
%timeit not any(b) # 33 ns, fastest for zero lists
%timeit all(n == 0 for n in a) # 365 ns
%timeit all(n == 0 for n in b) # 350 µs
%timeit set(a)=={0} # 228 µs
%timeit set(b)=={0} # 58 µs
If you can use numpy, then (np.array(nums) == 0).all() should do it.
Additionally to #schwobaseggl's answer, second example could be even better:
if set(nums)=={0}:
# do something

Efficiently check if an element occurs at least n times in a list

How to best write a Python function (check_list) to efficiently test if an element (x) occurs at least n times in a list (l)?
My first thought was:
def check_list(l, x, n):
return l.count(x) >= n
But this doesn't short-circuit once x has been found n times and is always O(n).
A simple approach that does short-circuit would be:
def check_list(l, x, n):
count = 0
for item in l:
if item == x:
count += 1
if count == n:
return True
return False
I also have a more compact short-circuiting solution with a generator:
def check_list(l, x, n):
gen = (1 for item in l if item == x)
return all(next(gen,0) for i in range(n))
Are there other good solutions? What is the best efficient approach?
Thank you
Instead of incurring extra overhead with the setup of a range object and using all which has to test the truthiness of each item, you could use itertools.islice to advance the generator n steps ahead, and then return the next item in the slice if the slice exists or a default False if not:
from itertools import islice
def check_list(lst, x, n):
gen = (True for i in lst if i==x)
return next(islice(gen, n-1, None), False)
Note that like list.count, itertools.islice also runs at C speed. And this has the extra advantage of handling iterables that are not lists.
Some timing:
In [1]: from itertools import islice
In [2]: from random import randrange
In [3]: lst = [randrange(1,10) for i in range(100000)]
In [5]: %%timeit # using list.index
....: check_list(lst, 5, 1000)
....:
1000 loops, best of 3: 736 µs per loop
In [7]: %%timeit # islice
....: check_list(lst, 5, 1000)
....:
1000 loops, best of 3: 662 µs per loop
In [9]: %%timeit # using list.index
....: check_list(lst, 5, 10000)
....:
100 loops, best of 3: 7.6 ms per loop
In [11]: %%timeit # islice
....: check_list(lst, 5, 10000)
....:
100 loops, best of 3: 6.7 ms per loop
You could use the second argument of index to find the subsequent indices of occurrences:
def check_list(l, x, n):
i = 0
try:
for _ in range(n):
i = l.index(x, i)+1
return True
except ValueError:
return False
print( check_list([1,3,2,3,4,0,8,3,7,3,1,1,0], 3, 4) )
About index arguments
The official documentation does not mention in its Python Tutuorial, section 5 the method's second or third argument, but you can find it in the more comprehensive Python Standard Library, section 4.6:
s.index(x[, i[, j]]) index of the first occurrence of x in s (at or after index i and before index j) (8)
(8) index raises ValueError when x is not found in s. When supported, the additional arguments to the index method allow efficient searching of subsections of the sequence. Passing the extra arguments is roughly equivalent to using s[i:j].index(x), only without copying any data and with the returned index being relative to the start of the sequence rather than the start of the slice.
Performance Comparison
In comparing this list.index method with the islice(gen) method, the most important factor is the distance between the occurrences to be found. Once that distance is on average 13 or more, the list.index has a better performance. For lower distances, the fastest method also depends on the number of occurrences to find. The more occurrences to find, the sooner the islice(gen) method outperforms list.index in terms of average distance: this gain fades out when the number of occurrences becomes really large.
The following graph draws the (approximate) border line, at which both methods perform equally well (the X-axis is logarithmic):
Ultimately short circuiting is the way to go if you expect a significant number of cases will lead to early termination. Let's explore the possibilities:
Take the case of the list.index method versus the list.count method (these were the two fastest according to my testing, although ymmv)
For list.index if the list contains n or more of x and the method is called n times. Whilst within the list.index method, execution is very fast, allowing for much faster iteration than the custom generator. If the occurances of x are far enough apart, a large speedup will be seen from the lower level execution of index. If instances of x are close together (shorter list / more common x's), much more of the time will be spent executing the slower python code that mediates the rest of the function (looping over n and incrementing i)
The benefit of list.count is that it does all of the heavy lifting outside of slow python execution. It is a much easier function to analyse, as it is simply a case of O(n) time complexity. By spending almost none of the time in the python interpreter however it is almost gaurenteed to be faster for short lists.
Summary of selection criteria:
shorter lists favor list.count
lists of any length that don't have a high probability to short circuit favor list.count
lists that are long and likely to short circuit favor list.index
I would recommend using Counter from the collections module.
from collections import Counter
%%time
[k for k,v in Counter(np.random.randint(0,10000,10000000)).items() if v>1100]
#Output:
Wall time: 2.83 s
[1848, 1996, 2461, 4481, 4522, 5844, 7362, 7892, 9671, 9705]
This shows another way of doing it.
Sort the list.
Find the index of the first occurrence of the item.
Increase the index by one less than the number of times the item must occur. (n - 1)
Find if the element at that index is the same as the item you want to find.
def check_list(l, x, n):
_l = sorted(l)
try:
index_1 = _l.index(x)
return _l[index_1 + n - 1] == x
except IndexError:
return False
c=0
for i in l:
if i==k:
c+=1
if c>=n:
print("true")
else:
print("false")
Another possibility might be:
def check_list(l, x, n):
return sum([1 for i in l if i == x]) >= n

Conditional counting in Python

not sure this was asked before, but I couldn't find an obvious answer. I'm trying to count the number of elements in a list that are equal to a certain value. The problem is that these elements are not of a built-in type. So if I have
class A:
def __init__(self, a, b):
self.a = a
self.b = b
stuff = []
for i in range(1,10):
stuff.append(A(i/2, i%2))
Now I would like a count of the list elements whose field b = 1. I came up with two solutions:
print [e.b for e in stuff].count(1)
and
print len([e for e in stuff if e.b == 1])
Which is the best method? Is there a better alternative? It seems that the count() method does not accept keys (at least in Python version 2.5.1.
Many thanks!
sum(x.b == 1 for x in L)
A boolean (as resulting from comparisons such as x.b == 1) is also an int, with a value of 0 for False, 1 for True, so arithmetic such as summation works just fine.
This is the simplest code, but perhaps not the speediest (only timeit can tell you for sure;-). Consider (simplified case to fit well on command lines, but equivalent):
$ py26 -mtimeit -s'L=[1,2,1,3,1]*100' 'len([x for x in L if x==1])'
10000 loops, best of 3: 56.6 usec per loop
$ py26 -mtimeit -s'L=[1,2,1,3,1]*100' 'sum(x==1 for x in L)'
10000 loops, best of 3: 87.7 usec per loop
So, for this case, the "memory wasteful" approach of generating an extra temporary list and checking its length is actually solidly faster than the simpler, shorter, memory-thrifty one I tend to prefer. Other mixes of list values, Python implementations, availability of memory to "invest" in this speedup, etc, can affect the exact performance, of course.
print sum(1 for e in L if e.b == 1)
I would prefer the second one as it's only looping over the list once.
If you use count() you're looping over the list once to get the b values, and then looping over it again to see how many of them equal 1.
A neat way may to use reduce():
reduce(lambda x,y: x + (1 if y.b == 1 else 0),list,0)
The documentation tells us that reduce() will:
Apply function of two arguments cumulatively to the items of iterable, from left to right, so as to reduce the iterable to a single value.
So we define a lambda that adds one the accumulated value only if the list item's b attribute is 1.
To hide reduce details, you may define a count function:
def count(condition, stuff):
return reduce(lambda s, x: \
s + (1 if condition(x) else 0), stuff, 0)
Then you may use it by providing the condition for counting:
n = count(lambda i: i.b, stuff)
Given the input
name = ['ball', 'jeans', 'ball', 'ball', 'ball', 'jeans']
price = [1, 4, 1, 1, 1, 4]
weight = [2, 2, 2, 3, 2, 2]
First create a defaultdict to record the occurrence
from collections import defaultdict
occurrences = defaultdict(int)
Increment the count
for n, p, w in zip(name, price, weight):
occurrences[(n, p, w)] += 1
Finally count the ones that appear more than once (True will yield 1)
print(sum(cnt > 1 for cnt in occurrences.values())

Getting every odd variable in a list?

If I make a list in Python and want to write a function that would return only odd numbers from a range 1 to x how would I do that?
For example, if I have list [1, 2, 3, 4] from 1 to 4 (4 ix my x), I want to return [1, 3].
If you want to start with an arbitrary list:
[item for item in yourlist if item % 2]
but if you're always starting with range, range(1, x, 2) is better!-)
For example:
$ python -mtimeit -s'x=99' 'filter(lambda(t): t % 2 == 1, range(1, x))'
10000 loops, best of 3: 38.5 usec per loop
$ python -mtimeit -s'x=99' 'range(1, x, 2)'
1000000 loops, best of 3: 1.38 usec per loop
so the right approach is about 28 times (!) faster than a somewhat-typical wrong one, in this case.
The "more general than you need if that's all you need" solution:
$ python -mtimeit -s'yourlist=range(1,99)' '[item for item in yourlist if item % 2]'
10000 loops, best of 3: 21.6 usec per loop
is only about twice as fast as the sample wrong one, but still over 15 times slower than the "just right" one!-)
What's wrong with:
def getodds(lst):
return lst[1::2]
....???
(Assuming you want every other element from some arbitrary sequence ... all those which have odd indexes).
Alternatively if you want all items from a list of numbers where the value of that element is odd:
def oddonly(lst):
return [x for x in lst if x % 2]
[Update: 2017]
You could use "lazy evaluation" to yield these from generators:
def get_elements_at_odd_indices(sequence):
for index, item in enumerate(sequence):
if index % 2:
yield item
else:
continue
For getting odd elements (rather than elements at each odd offset from the start of the sequence) you could use the even simpler:
def get_odd_elements(sequence):
for item in sequence:
if item % 2:
yield item
else:
continue
This should work for any sequence or iterable object types. (Obviously the latter only works for those sequences or iterables which yield numbers ... or other types for which % 2 evaluates to a meaningfully "odd" result).
Also note that, if we want to efficiently operate on Pandas series or dataframe columns, or the underlying NumPy then we could get the elements at odd indexes using the [1::2] slice notation, and we can get each of the elements containing odd values using NumPy's "fancy indexing"
For example:
import numpy as nd
arr = nd.arange(1000)
odds = arr[arr%2!=0]
I show the "fancy index" as arr[arr%2!=0] because that will generalize better to filtering out every third, fourth or other nth element; but you can use much more elaborate expressions.
Note that the syntax arr[arr%2!=0] may look a bit odd. It's magic in the way that NumPy over-rides various arithmetic and bitwise operators and augmented assignment operations. The point is that NumPy evaluates such operations into machine code which can be efficiently vectorized over NumPy arrays ... using SIMD wherever the underlying CPU supports. For example on typical laptop and desktop systems today NumPy can evaluate many arithmetic operations into SSE operations.
To have a range of odd/even numbers up to and possibly including a number n, you can:
def odd_numbers(n):
return range(1, n+1, 2)
def even_numbers(n):
return range(0, n+1, 2)
If you want a generic algorithm that will take the items with odd indexes from a sequence, you can do the following:
import itertools
def odd_indexes(sequence):
return itertools.islice(sequence, 1, None, 2)
def even_indexes(sequence):
return itertools.islice(sequence, 0, None, 2)

Categories