not sure this was asked before, but I couldn't find an obvious answer. I'm trying to count the number of elements in a list that are equal to a certain value. The problem is that these elements are not of a built-in type. So if I have
class A:
def __init__(self, a, b):
self.a = a
self.b = b
stuff = []
for i in range(1,10):
stuff.append(A(i/2, i%2))
Now I would like a count of the list elements whose field b = 1. I came up with two solutions:
print [e.b for e in stuff].count(1)
and
print len([e for e in stuff if e.b == 1])
Which is the best method? Is there a better alternative? It seems that the count() method does not accept keys (at least in Python version 2.5.1.
Many thanks!
sum(x.b == 1 for x in L)
A boolean (as resulting from comparisons such as x.b == 1) is also an int, with a value of 0 for False, 1 for True, so arithmetic such as summation works just fine.
This is the simplest code, but perhaps not the speediest (only timeit can tell you for sure;-). Consider (simplified case to fit well on command lines, but equivalent):
$ py26 -mtimeit -s'L=[1,2,1,3,1]*100' 'len([x for x in L if x==1])'
10000 loops, best of 3: 56.6 usec per loop
$ py26 -mtimeit -s'L=[1,2,1,3,1]*100' 'sum(x==1 for x in L)'
10000 loops, best of 3: 87.7 usec per loop
So, for this case, the "memory wasteful" approach of generating an extra temporary list and checking its length is actually solidly faster than the simpler, shorter, memory-thrifty one I tend to prefer. Other mixes of list values, Python implementations, availability of memory to "invest" in this speedup, etc, can affect the exact performance, of course.
print sum(1 for e in L if e.b == 1)
I would prefer the second one as it's only looping over the list once.
If you use count() you're looping over the list once to get the b values, and then looping over it again to see how many of them equal 1.
A neat way may to use reduce():
reduce(lambda x,y: x + (1 if y.b == 1 else 0),list,0)
The documentation tells us that reduce() will:
Apply function of two arguments cumulatively to the items of iterable, from left to right, so as to reduce the iterable to a single value.
So we define a lambda that adds one the accumulated value only if the list item's b attribute is 1.
To hide reduce details, you may define a count function:
def count(condition, stuff):
return reduce(lambda s, x: \
s + (1 if condition(x) else 0), stuff, 0)
Then you may use it by providing the condition for counting:
n = count(lambda i: i.b, stuff)
Given the input
name = ['ball', 'jeans', 'ball', 'ball', 'ball', 'jeans']
price = [1, 4, 1, 1, 1, 4]
weight = [2, 2, 2, 3, 2, 2]
First create a defaultdict to record the occurrence
from collections import defaultdict
occurrences = defaultdict(int)
Increment the count
for n, p, w in zip(name, price, weight):
occurrences[(n, p, w)] += 1
Finally count the ones that appear more than once (True will yield 1)
print(sum(cnt > 1 for cnt in occurrences.values())
Related
This is an implementation question for Python 2.7
Say I have a list of integers called nums, and I need to check if all values in nums are equal to zero. nums contains many elements (i.e. more than 10000), with many repeating values.
Using all():
if all(n == 0 for n in set(nums)): # I assume this conversion from list to set helps?
# do something
Using set subtraction:
if set(nums) - {0} == set([]):
# do something
Edit: better way to do the above approach, courtesy of user U9-Forward
if set(nums) == {0}:
# do something
How do the time and space complexities compare for each of these approaches? Is there a more efficient way to check this?
Note: for this case, I am trying to avoid using numpy/pandas.
Any set conversion of nums won't help as it will iterate the entire list:
if all(n == 0 for n in nums):
# ...
is just fine as it stops at the first non-zero element, disregarding the remainder.
Asymptotically, all these approaches are linear with random data.
Implementational details (no repeated function calls on the generator) makes not any(nums) even faster, but that relies on the absence of any other falsy elements but0, e.g. '' or None.
not any(nums) is probably the fastest because it will stop when/if it finds any non-zero element.
Performance comparison:
a = range(10000)
b = [0] * 10000
%timeit not any(a) # 72 ns, fastest for non-zero lists
%timeit not any(b) # 33 ns, fastest for zero lists
%timeit all(n == 0 for n in a) # 365 ns
%timeit all(n == 0 for n in b) # 350 µs
%timeit set(a)=={0} # 228 µs
%timeit set(b)=={0} # 58 µs
If you can use numpy, then (np.array(nums) == 0).all() should do it.
Additionally to #schwobaseggl's answer, second example could be even better:
if set(nums)=={0}:
# do something
I found this one line function on the python wiki that creates a set of all sets that can be created from a list passed as an argument.
f = lambda x: [[y for j, y in enumerate(set(x)) if (i >> j) & 1] for i in range(2**len(set(x)))]
Can someone please explain how this function works?
To construct the powerset, iterating over 2**len(set(x)) gives you all the binary combinations of the set.
range(2**len(set(x))) == [00000, 00001, 00010, ..., 11110, 11111]
Now you just need to test if the bit is set in i to see if you need to include it in the set, e.g.:
>>> i = 0b10010
>>> [y for j, y in enumerate(range(5)) if (i >> j) & 1]
[1, 4]
Though I'm not sure how efficient it is given the call to set(x) for every iteration. There is a small hack that would avoid that:
f = lambda x: [[y for j, y in enumerate(s) if (i >> j) & 1] for s in [set(x)] for i in range(2**len(s))]
A couple of other forms using itertools:
import itertools as it
f1 = lambda x: [list(it.compress(s, i)) for s in [set(x)] for i in it.product((0,1), repeat=len(s))]
f2 = lambda x: list(it.chain.from_iterable(it.combinations(set(x), r) for r in range(len(set(x))+1)))
Note: this last one could just return an iterable vs list if you remove list() depending on the use-case this could save some memory.
Looking at some timings of a list of 25 random numbers 0-50:
%%timeit binary: 1 loop, best of 3: 20.1 s per loop
%%timeit binary+hack: 1 loop, best of 3: 17.9 s per loop
%%timeit compress/product: 1 loop, best of 3: 5.27 s per loop
%%timeit chain/combinations: 1 loop, best of 3: 659 ms per loop
Let's rewrite it a bit and break it down step by step:
f = lambda x: [[y for j, y in enumerate(set(x)) if (i >> j) & 1] for i in range(2**len(set(x)))]
is equivalent to:
def f(x):
n = len(set(x))
sets = []
for i in range(n): # all combinations of members of the set in binary
set_i = []
for j, y in enumerate(set(x)):
if (i>>j) & 1: #check if bit nr j is set
set_x.append(y)
sets.append(set_i)
return sets
for an input list like [1,2,3,4], the following happens:
n=4
range(2**n)=[0,1,2,3...15]
which, in binary is:
0,1,10,11,100...1110,1111
Enumerate makes tuples of y with its index, so in our case:
[(0,1),(1,2),(2,3),(3,4)]
The (i>>j) & 1 part might require some explanation.
(i>>j) shifts the number i j places to the right, e.g. in decimal: 4>>2=1, or in binary:100>>2=001. The & is the bit-wise and operator. This checks, for every bit of both operands, if they are 1 and returns the result as a number, acting like a filter: 10111 & 11001 = 10101.
In the case of our example, it checks if the bit at place j is 1. If it is, the corresponding value is added to the result list. This way the binary map of combinations is converted to a list of lists, which is returned.
How to best write a Python function (check_list) to efficiently test if an element (x) occurs at least n times in a list (l)?
My first thought was:
def check_list(l, x, n):
return l.count(x) >= n
But this doesn't short-circuit once x has been found n times and is always O(n).
A simple approach that does short-circuit would be:
def check_list(l, x, n):
count = 0
for item in l:
if item == x:
count += 1
if count == n:
return True
return False
I also have a more compact short-circuiting solution with a generator:
def check_list(l, x, n):
gen = (1 for item in l if item == x)
return all(next(gen,0) for i in range(n))
Are there other good solutions? What is the best efficient approach?
Thank you
Instead of incurring extra overhead with the setup of a range object and using all which has to test the truthiness of each item, you could use itertools.islice to advance the generator n steps ahead, and then return the next item in the slice if the slice exists or a default False if not:
from itertools import islice
def check_list(lst, x, n):
gen = (True for i in lst if i==x)
return next(islice(gen, n-1, None), False)
Note that like list.count, itertools.islice also runs at C speed. And this has the extra advantage of handling iterables that are not lists.
Some timing:
In [1]: from itertools import islice
In [2]: from random import randrange
In [3]: lst = [randrange(1,10) for i in range(100000)]
In [5]: %%timeit # using list.index
....: check_list(lst, 5, 1000)
....:
1000 loops, best of 3: 736 µs per loop
In [7]: %%timeit # islice
....: check_list(lst, 5, 1000)
....:
1000 loops, best of 3: 662 µs per loop
In [9]: %%timeit # using list.index
....: check_list(lst, 5, 10000)
....:
100 loops, best of 3: 7.6 ms per loop
In [11]: %%timeit # islice
....: check_list(lst, 5, 10000)
....:
100 loops, best of 3: 6.7 ms per loop
You could use the second argument of index to find the subsequent indices of occurrences:
def check_list(l, x, n):
i = 0
try:
for _ in range(n):
i = l.index(x, i)+1
return True
except ValueError:
return False
print( check_list([1,3,2,3,4,0,8,3,7,3,1,1,0], 3, 4) )
About index arguments
The official documentation does not mention in its Python Tutuorial, section 5 the method's second or third argument, but you can find it in the more comprehensive Python Standard Library, section 4.6:
s.index(x[, i[, j]]) index of the first occurrence of x in s (at or after index i and before index j) (8)
(8) index raises ValueError when x is not found in s. When supported, the additional arguments to the index method allow efficient searching of subsections of the sequence. Passing the extra arguments is roughly equivalent to using s[i:j].index(x), only without copying any data and with the returned index being relative to the start of the sequence rather than the start of the slice.
Performance Comparison
In comparing this list.index method with the islice(gen) method, the most important factor is the distance between the occurrences to be found. Once that distance is on average 13 or more, the list.index has a better performance. For lower distances, the fastest method also depends on the number of occurrences to find. The more occurrences to find, the sooner the islice(gen) method outperforms list.index in terms of average distance: this gain fades out when the number of occurrences becomes really large.
The following graph draws the (approximate) border line, at which both methods perform equally well (the X-axis is logarithmic):
Ultimately short circuiting is the way to go if you expect a significant number of cases will lead to early termination. Let's explore the possibilities:
Take the case of the list.index method versus the list.count method (these were the two fastest according to my testing, although ymmv)
For list.index if the list contains n or more of x and the method is called n times. Whilst within the list.index method, execution is very fast, allowing for much faster iteration than the custom generator. If the occurances of x are far enough apart, a large speedup will be seen from the lower level execution of index. If instances of x are close together (shorter list / more common x's), much more of the time will be spent executing the slower python code that mediates the rest of the function (looping over n and incrementing i)
The benefit of list.count is that it does all of the heavy lifting outside of slow python execution. It is a much easier function to analyse, as it is simply a case of O(n) time complexity. By spending almost none of the time in the python interpreter however it is almost gaurenteed to be faster for short lists.
Summary of selection criteria:
shorter lists favor list.count
lists of any length that don't have a high probability to short circuit favor list.count
lists that are long and likely to short circuit favor list.index
I would recommend using Counter from the collections module.
from collections import Counter
%%time
[k for k,v in Counter(np.random.randint(0,10000,10000000)).items() if v>1100]
#Output:
Wall time: 2.83 s
[1848, 1996, 2461, 4481, 4522, 5844, 7362, 7892, 9671, 9705]
This shows another way of doing it.
Sort the list.
Find the index of the first occurrence of the item.
Increase the index by one less than the number of times the item must occur. (n - 1)
Find if the element at that index is the same as the item you want to find.
def check_list(l, x, n):
_l = sorted(l)
try:
index_1 = _l.index(x)
return _l[index_1 + n - 1] == x
except IndexError:
return False
c=0
for i in l:
if i==k:
c+=1
if c>=n:
print("true")
else:
print("false")
Another possibility might be:
def check_list(l, x, n):
return sum([1 for i in l if i == x]) >= n
I'm currently working on a high-performance python 2.7 project utilizing lists ten thousands elements in size. Obviously, every operation must be performed as fast as possible.
So, I have two lists: One of them is a list of unique arbitrary numbers, let's call it A, and the other one is a linear list starting with 1 and with the same length as the first list, named B, which represents indices in A (starting with 1)
Something like enumerate, starting with 1.
For example:
A = [500, 300, 400, 200, 100] # The order here is arbitrary, they can be any integers, but every integer can only exist once
B = [ 1, 2, 3, 4, 5] # This is fixed, starting from 1, with exactly as many elements as A
If I have an element of B (called e_B) and want the corresponding element in A, I can simply do correspond_e_A = A[e_B - 1]. No problem.
But now I have a huge list of random, non-unique integers, and I want to know the indices of the integers that are in A, and what the corresponding elements in B are.
I think I have a reasonable solution for the first question:
indices_of_existing = numpy.nonzero(numpy.in1d(random_list, A))[0]
What is great about this approach is that there is no need to map() single operations, numpy's in1d just returns a list like [True, True, False, True, ...]. Using nonzero() I can get the indices of the elements in random_list that exist in A. Perfect, I think.
But for the second question, I'm stumped.
I tried something like:
corresponding_e_B = map(lambda x: numpy.where(A==x)[0][0] + 1, random_list))
This correctly gives me the indices, but it's not optimal, because firstly I need a map(), secondly I need a lambda, and finally numpy.where() does not stop after the item was found once (remember, A has only unique elements), meaning that it scales horribly with huge datasets like mine.
I took a look at bisect, but it seems bisect only works with single requests, not with lists, meaning that I'd still have to use map() and build my list elementwise (that's slow, isn't it?)
Since I'm quite new to Python, I was hoping anyone here might have an idea? Maybe a library I don't know yet?
I think you would be well advised to use a hashtable for your lookups instead of numpy.in1d, which uses a O(n log n) merge sort as a preprocessing step.
>>> A = [500, 300, 400, 200, 100]
>>> index = { k:i for i,k in enumerate(A, 1) }
>>> random_list = [200, 100, 50]
>>> [i for i,x in enumerate(random_list) if x in index]
[0, 1]
>>> map(index.get, random_list)
[4, 5, None]
>>> filter(None, map(index.get, random_list))
[4, 5]
This is Python 2, in Python 3 map and filter return generator-like objects, so you would need a list around filter if you want to get the result as a list.
I have tried to use builtin functions as much as possible to push the computational burden to the C side (assuming you use CPython). All the names are resolved upfront, so it should be pretty fast.
In general, for maximum performance, you might want to consider using PyPy, a great alternative Python implementation with JIT compilation.
A benchmark to compare multiple approaches is never a bad idea:
import sys
is_pypy = '__pypy__' in sys.builtin_module_names
import timeit
import random
if not is_pypy:
import numpy
import bisect
n = 10000
m = 10000
q = 100
A = set()
while len(A) < n: A.add(random.randint(0,2*n))
A = list(A)
queries = set()
while len(queries) < m: queries.add(random.randint(0,2*n))
queries = list(queries)
# these two solve question one (find indices of queries that exist in A)
if not is_pypy:
def fun11():
for _ in range(q):
numpy.nonzero(numpy.in1d(queries, A))[0]
def fun12():
index = set(A)
for _ in range(q):
[i for i,x in enumerate(queries) if x in index]
# these three solve question two (find according entries of B)
def fun21():
index = { k:i for i,k in enumerate(A, 1) }
for _ in range(q):
[index[i] for i in queries if i in index]
def fun22():
index = { k:i for i,k in enumerate(A, 1) }
for _ in range(q):
list(filter(None, map(index.get, queries)))
def findit(keys, values, key):
i = bisect.bisect(keys, key)
if i == len(keys) or keys[i] != key:
return None
return values[i]
def fun23():
keys, values = zip(*sorted((k,i) for i,k in enumerate(A,1)))
for _ in range(q):
list(filter(None, [findit(keys, values, x) for x in queries]))
if not is_pypy:
# note this does not filter out nonexisting elements
def fun24():
I = numpy.argsort(A)
ss = numpy.searchsorted
maxi = len(I)
for _ in range(q):
a = ss(A, queries, sorter=I)
I[a[a<maxi]]
tests = ("fun12", "fun21", "fun22", "fun23")
if not is_pypy: tests = ("fun11",) + tests + ("fun24",)
if is_pypy:
# warmup
for f in tests:
timeit.timeit("%s()" % f, setup = "from __main__ import %s" % f, number=20)
# actual timing
for f in tests:
print("%s: %.3f" % (f, timeit.timeit("%s()" % f, setup = "from __main__ import %s" % f, number=3)))
Results:
$ python2 -V
Python 2.7.6
$ python3 -V
Python 3.3.3
$ pypy -V
Python 2.7.3 (87aa9de10f9ca71da9ab4a3d53e0ba176b67d086, Dec 04 2013, 12:50:47)
[PyPy 2.2.1 with GCC 4.8.2]
$ python2 test.py
fun11: 1.016
fun12: 0.349
fun21: 0.302
fun22: 0.276
fun23: 2.432
fun24: 0.897
$ python3 test.py
fun11: 0.973
fun12: 0.382
fun21: 0.423
fun22: 0.341
fun23: 3.650
fun24: 0.894
$ pypy ~/tmp/test.py
fun12: 0.087
fun21: 0.073
fun22: 0.128
fun23: 1.131
You can tweak n (size of A), m (size of random_list) and q (number of queries) to your scenario. To my surprise, my attempt to be clever and use builtin functions instead of list comps has not paid off, since fun22 is not a lot faster than fun21 (only ~10% In Python 2 and ~25% in Python 3, but almost 75% slower in PyPy). A case of premature optimization here. I guess the difference is due to the fact that fun22 builds up an unnecessary temporary list per query in Python 2. We also see that binary search is pretty bad (look at fun23).
def numpy_optimized(index, values):
I = np.argsort(values)
Q = np.searchsorted(values, index, sorter=I)
return I[Q]
This calculates the same thing as OP, but with the indices in matching order to the values queried, which I imagine is an improvement in functionality. It is up to twice as fast as OP's solution on my machine; which puts it slightly ahead of the non-pypy solutions, if I interpret your benchmarks correctly.
Or in case we cannot assume all index are present in values, and would like missing queries to be silently dropped:
def numpy_optimized_filtered(index, values):
I = np.argsort(values)
Q = np.searchsorted(values, index, sorter=I)
Z = I[Q]
return Z[values[Z]==index]
If I make a list in Python and want to write a function that would return only odd numbers from a range 1 to x how would I do that?
For example, if I have list [1, 2, 3, 4] from 1 to 4 (4 ix my x), I want to return [1, 3].
If you want to start with an arbitrary list:
[item for item in yourlist if item % 2]
but if you're always starting with range, range(1, x, 2) is better!-)
For example:
$ python -mtimeit -s'x=99' 'filter(lambda(t): t % 2 == 1, range(1, x))'
10000 loops, best of 3: 38.5 usec per loop
$ python -mtimeit -s'x=99' 'range(1, x, 2)'
1000000 loops, best of 3: 1.38 usec per loop
so the right approach is about 28 times (!) faster than a somewhat-typical wrong one, in this case.
The "more general than you need if that's all you need" solution:
$ python -mtimeit -s'yourlist=range(1,99)' '[item for item in yourlist if item % 2]'
10000 loops, best of 3: 21.6 usec per loop
is only about twice as fast as the sample wrong one, but still over 15 times slower than the "just right" one!-)
What's wrong with:
def getodds(lst):
return lst[1::2]
....???
(Assuming you want every other element from some arbitrary sequence ... all those which have odd indexes).
Alternatively if you want all items from a list of numbers where the value of that element is odd:
def oddonly(lst):
return [x for x in lst if x % 2]
[Update: 2017]
You could use "lazy evaluation" to yield these from generators:
def get_elements_at_odd_indices(sequence):
for index, item in enumerate(sequence):
if index % 2:
yield item
else:
continue
For getting odd elements (rather than elements at each odd offset from the start of the sequence) you could use the even simpler:
def get_odd_elements(sequence):
for item in sequence:
if item % 2:
yield item
else:
continue
This should work for any sequence or iterable object types. (Obviously the latter only works for those sequences or iterables which yield numbers ... or other types for which % 2 evaluates to a meaningfully "odd" result).
Also note that, if we want to efficiently operate on Pandas series or dataframe columns, or the underlying NumPy then we could get the elements at odd indexes using the [1::2] slice notation, and we can get each of the elements containing odd values using NumPy's "fancy indexing"
For example:
import numpy as nd
arr = nd.arange(1000)
odds = arr[arr%2!=0]
I show the "fancy index" as arr[arr%2!=0] because that will generalize better to filtering out every third, fourth or other nth element; but you can use much more elaborate expressions.
Note that the syntax arr[arr%2!=0] may look a bit odd. It's magic in the way that NumPy over-rides various arithmetic and bitwise operators and augmented assignment operations. The point is that NumPy evaluates such operations into machine code which can be efficiently vectorized over NumPy arrays ... using SIMD wherever the underlying CPU supports. For example on typical laptop and desktop systems today NumPy can evaluate many arithmetic operations into SSE operations.
To have a range of odd/even numbers up to and possibly including a number n, you can:
def odd_numbers(n):
return range(1, n+1, 2)
def even_numbers(n):
return range(0, n+1, 2)
If you want a generic algorithm that will take the items with odd indexes from a sequence, you can do the following:
import itertools
def odd_indexes(sequence):
return itertools.islice(sequence, 1, None, 2)
def even_indexes(sequence):
return itertools.islice(sequence, 0, None, 2)