If I make a list in Python and want to write a function that would return only odd numbers from a range 1 to x how would I do that?
For example, if I have list [1, 2, 3, 4] from 1 to 4 (4 ix my x), I want to return [1, 3].
If you want to start with an arbitrary list:
[item for item in yourlist if item % 2]
but if you're always starting with range, range(1, x, 2) is better!-)
For example:
$ python -mtimeit -s'x=99' 'filter(lambda(t): t % 2 == 1, range(1, x))'
10000 loops, best of 3: 38.5 usec per loop
$ python -mtimeit -s'x=99' 'range(1, x, 2)'
1000000 loops, best of 3: 1.38 usec per loop
so the right approach is about 28 times (!) faster than a somewhat-typical wrong one, in this case.
The "more general than you need if that's all you need" solution:
$ python -mtimeit -s'yourlist=range(1,99)' '[item for item in yourlist if item % 2]'
10000 loops, best of 3: 21.6 usec per loop
is only about twice as fast as the sample wrong one, but still over 15 times slower than the "just right" one!-)
What's wrong with:
def getodds(lst):
return lst[1::2]
....???
(Assuming you want every other element from some arbitrary sequence ... all those which have odd indexes).
Alternatively if you want all items from a list of numbers where the value of that element is odd:
def oddonly(lst):
return [x for x in lst if x % 2]
[Update: 2017]
You could use "lazy evaluation" to yield these from generators:
def get_elements_at_odd_indices(sequence):
for index, item in enumerate(sequence):
if index % 2:
yield item
else:
continue
For getting odd elements (rather than elements at each odd offset from the start of the sequence) you could use the even simpler:
def get_odd_elements(sequence):
for item in sequence:
if item % 2:
yield item
else:
continue
This should work for any sequence or iterable object types. (Obviously the latter only works for those sequences or iterables which yield numbers ... or other types for which % 2 evaluates to a meaningfully "odd" result).
Also note that, if we want to efficiently operate on Pandas series or dataframe columns, or the underlying NumPy then we could get the elements at odd indexes using the [1::2] slice notation, and we can get each of the elements containing odd values using NumPy's "fancy indexing"
For example:
import numpy as nd
arr = nd.arange(1000)
odds = arr[arr%2!=0]
I show the "fancy index" as arr[arr%2!=0] because that will generalize better to filtering out every third, fourth or other nth element; but you can use much more elaborate expressions.
Note that the syntax arr[arr%2!=0] may look a bit odd. It's magic in the way that NumPy over-rides various arithmetic and bitwise operators and augmented assignment operations. The point is that NumPy evaluates such operations into machine code which can be efficiently vectorized over NumPy arrays ... using SIMD wherever the underlying CPU supports. For example on typical laptop and desktop systems today NumPy can evaluate many arithmetic operations into SSE operations.
To have a range of odd/even numbers up to and possibly including a number n, you can:
def odd_numbers(n):
return range(1, n+1, 2)
def even_numbers(n):
return range(0, n+1, 2)
If you want a generic algorithm that will take the items with odd indexes from a sequence, you can do the following:
import itertools
def odd_indexes(sequence):
return itertools.islice(sequence, 1, None, 2)
def even_indexes(sequence):
return itertools.islice(sequence, 0, None, 2)
Related
def multArray(A, k):
A = [i * k for i in A]
return A
# tests
tests = (([5,12,31,7,25],10),
([-5,12,-31,7,25],10),
([-5,12,-31,7,25],-1),
([-5,12,-31,7,25],0),
([],10),
([],-1))
# should print: [50,120,310,70,250],[-50,120,-310,70,250],[5,-12,31,-7,-25],[0,0,0,0,0],[],[]
for (A,k) in tests:
multArray(A, k)
print(A)
This is the solution I see on here from other questions but I can't seem to get it to work. Needs to be done without maps or numpty.
Your example code is creating a new list with the updated values, however, the test prints the original list. To multiple the original list by a factor you need to update the values in the original list, something like this:
def multArray(A, k):
for i in range(len(A)):
A[i] *= k
# tests
tests = (([5,12,31,7,25],10),
([-5,12,-31,7,25],10),
([-5,12,-31,7,25],-1),
([-5,12,-31,7,25],0),
([],10),
([],-1))
for (A,k) in tests:
multArray(A, k)
print(A)
Creating new list vs. updating existing list
It must be noted that it is much quicker to create a new list in a case like this. If we look at the performance of creating a new list versus updating the existing list:
python -m timeit -s "a = [i for i in range(100)]" "b = [i * 7 for i in a]"
executes in:
50000 loops, best of 5: 8.54 usec per loop
whereas:
python -m timeit -s "a = [i for i in range(100)]" "for i in range(len(a)): a[i] *= 7"
executes in:
5000 loops, best of 5: 40.6 usec per loop
So it is roughly x5 faster to create a new list. There is an argument to say updating the existing list is more memory efficient, but certainly not for these types of examples.
Relative performance of numpy array
It should also be noted that multiplying all the elements in a numpy array by a particular value is much faster than either method using lists.
python -m timeit -s "import numpy as np; a = np.array([i for i in range(100)])" "a = a * 7"
executes in:
500000 loops, best of 5: 839 nsec per loop
This is because the memory allocated in numpy arrays is contiguous in memory so operations benefit from a wide range processor level caching effects.
This is an implementation question for Python 2.7
Say I have a list of integers called nums, and I need to check if all values in nums are equal to zero. nums contains many elements (i.e. more than 10000), with many repeating values.
Using all():
if all(n == 0 for n in set(nums)): # I assume this conversion from list to set helps?
# do something
Using set subtraction:
if set(nums) - {0} == set([]):
# do something
Edit: better way to do the above approach, courtesy of user U9-Forward
if set(nums) == {0}:
# do something
How do the time and space complexities compare for each of these approaches? Is there a more efficient way to check this?
Note: for this case, I am trying to avoid using numpy/pandas.
Any set conversion of nums won't help as it will iterate the entire list:
if all(n == 0 for n in nums):
# ...
is just fine as it stops at the first non-zero element, disregarding the remainder.
Asymptotically, all these approaches are linear with random data.
Implementational details (no repeated function calls on the generator) makes not any(nums) even faster, but that relies on the absence of any other falsy elements but0, e.g. '' or None.
not any(nums) is probably the fastest because it will stop when/if it finds any non-zero element.
Performance comparison:
a = range(10000)
b = [0] * 10000
%timeit not any(a) # 72 ns, fastest for non-zero lists
%timeit not any(b) # 33 ns, fastest for zero lists
%timeit all(n == 0 for n in a) # 365 ns
%timeit all(n == 0 for n in b) # 350 µs
%timeit set(a)=={0} # 228 µs
%timeit set(b)=={0} # 58 µs
If you can use numpy, then (np.array(nums) == 0).all() should do it.
Additionally to #schwobaseggl's answer, second example could be even better:
if set(nums)=={0}:
# do something
How to best write a Python function (check_list) to efficiently test if an element (x) occurs at least n times in a list (l)?
My first thought was:
def check_list(l, x, n):
return l.count(x) >= n
But this doesn't short-circuit once x has been found n times and is always O(n).
A simple approach that does short-circuit would be:
def check_list(l, x, n):
count = 0
for item in l:
if item == x:
count += 1
if count == n:
return True
return False
I also have a more compact short-circuiting solution with a generator:
def check_list(l, x, n):
gen = (1 for item in l if item == x)
return all(next(gen,0) for i in range(n))
Are there other good solutions? What is the best efficient approach?
Thank you
Instead of incurring extra overhead with the setup of a range object and using all which has to test the truthiness of each item, you could use itertools.islice to advance the generator n steps ahead, and then return the next item in the slice if the slice exists or a default False if not:
from itertools import islice
def check_list(lst, x, n):
gen = (True for i in lst if i==x)
return next(islice(gen, n-1, None), False)
Note that like list.count, itertools.islice also runs at C speed. And this has the extra advantage of handling iterables that are not lists.
Some timing:
In [1]: from itertools import islice
In [2]: from random import randrange
In [3]: lst = [randrange(1,10) for i in range(100000)]
In [5]: %%timeit # using list.index
....: check_list(lst, 5, 1000)
....:
1000 loops, best of 3: 736 µs per loop
In [7]: %%timeit # islice
....: check_list(lst, 5, 1000)
....:
1000 loops, best of 3: 662 µs per loop
In [9]: %%timeit # using list.index
....: check_list(lst, 5, 10000)
....:
100 loops, best of 3: 7.6 ms per loop
In [11]: %%timeit # islice
....: check_list(lst, 5, 10000)
....:
100 loops, best of 3: 6.7 ms per loop
You could use the second argument of index to find the subsequent indices of occurrences:
def check_list(l, x, n):
i = 0
try:
for _ in range(n):
i = l.index(x, i)+1
return True
except ValueError:
return False
print( check_list([1,3,2,3,4,0,8,3,7,3,1,1,0], 3, 4) )
About index arguments
The official documentation does not mention in its Python Tutuorial, section 5 the method's second or third argument, but you can find it in the more comprehensive Python Standard Library, section 4.6:
s.index(x[, i[, j]]) index of the first occurrence of x in s (at or after index i and before index j) (8)
(8) index raises ValueError when x is not found in s. When supported, the additional arguments to the index method allow efficient searching of subsections of the sequence. Passing the extra arguments is roughly equivalent to using s[i:j].index(x), only without copying any data and with the returned index being relative to the start of the sequence rather than the start of the slice.
Performance Comparison
In comparing this list.index method with the islice(gen) method, the most important factor is the distance between the occurrences to be found. Once that distance is on average 13 or more, the list.index has a better performance. For lower distances, the fastest method also depends on the number of occurrences to find. The more occurrences to find, the sooner the islice(gen) method outperforms list.index in terms of average distance: this gain fades out when the number of occurrences becomes really large.
The following graph draws the (approximate) border line, at which both methods perform equally well (the X-axis is logarithmic):
Ultimately short circuiting is the way to go if you expect a significant number of cases will lead to early termination. Let's explore the possibilities:
Take the case of the list.index method versus the list.count method (these were the two fastest according to my testing, although ymmv)
For list.index if the list contains n or more of x and the method is called n times. Whilst within the list.index method, execution is very fast, allowing for much faster iteration than the custom generator. If the occurances of x are far enough apart, a large speedup will be seen from the lower level execution of index. If instances of x are close together (shorter list / more common x's), much more of the time will be spent executing the slower python code that mediates the rest of the function (looping over n and incrementing i)
The benefit of list.count is that it does all of the heavy lifting outside of slow python execution. It is a much easier function to analyse, as it is simply a case of O(n) time complexity. By spending almost none of the time in the python interpreter however it is almost gaurenteed to be faster for short lists.
Summary of selection criteria:
shorter lists favor list.count
lists of any length that don't have a high probability to short circuit favor list.count
lists that are long and likely to short circuit favor list.index
I would recommend using Counter from the collections module.
from collections import Counter
%%time
[k for k,v in Counter(np.random.randint(0,10000,10000000)).items() if v>1100]
#Output:
Wall time: 2.83 s
[1848, 1996, 2461, 4481, 4522, 5844, 7362, 7892, 9671, 9705]
This shows another way of doing it.
Sort the list.
Find the index of the first occurrence of the item.
Increase the index by one less than the number of times the item must occur. (n - 1)
Find if the element at that index is the same as the item you want to find.
def check_list(l, x, n):
_l = sorted(l)
try:
index_1 = _l.index(x)
return _l[index_1 + n - 1] == x
except IndexError:
return False
c=0
for i in l:
if i==k:
c+=1
if c>=n:
print("true")
else:
print("false")
Another possibility might be:
def check_list(l, x, n):
return sum([1 for i in l if i == x]) >= n
Let's say I have a list:
list=['plu;ean;price;quantity','plu1;ean1;price1;quantity1']
I want to iterate over the list + split the list by ";" and put an if clause, like this:
for item in list:
split_item=item.split(";")
if split_item[0] == "string_value" or split_item[1] == "string_value":
do something.....
I was wondering, if this is the fastest way possible? Let's say my initial list is a lot bigger (has a lot more list items). I tried with list comprehensions:
item=[item.split(";") for item in list if item.split(";")[0] == "string_value" or item.split(";")[1] == "string_value"]
But this is actually giving me slower results. The first case is giving me an average of 90ms, while the second one is giving me an average of 130ms.
Am I doing the list comprehension wrong? Is there a faster solution?
I was wondering, if this is the fastest way possible?
No, of course not. You can implement it a lot faster in hand-coded assembly than in Python. So what?
If the "do something..." is not trivial, and there are many matches, the cost to do something 100000 times is going to be a lot more expensive than the cost of looping 500000 times, so finding the fastest way to loop doesn't matter at all.
In fact, just calling split two to three each loop instead of remembering and reusing the result is going to swamp the cost of iteration, and not passing a maxsplit argument when you only care about two results may as well.
So, you're trying to optimize the wrong thing. But what if, after you fix everything else, it turns out that the cost of iteration really does matter here?
Well, you can't use a comprehension directly to speed things up, because comprehensions are for expressions that return values, not statements to do things.
But, if you look at your code, you'll realize you're actually doing three things: splitting each string, then filtering out the ones that don't match, then doing the "do something". So, you can use a comprehension for the first two parts, and then you're only using a slow for loop for the much smaller list of values that passed the filter.
It looks like you tried this, but you made two mistakes.
First, you're better off with a generator expression than a list comprehension—you don't need a list here, just something to iterator over, so don't pay to build one.
Second, you don't want to split the string three times. You can probably find some convoluted way to get the split done once in a single comprehension, but why bother? Just write each step as its own step.
So:
split_items = (item.split(';') for item in items)
filtered_items = (item for item in split_items
if item[0] == "string_value" or item[1] == "string_value")
for item in filtered_items:
do something...
Will this actually be faster? If you can get some real test data, and "do something..." code, that shows that the iteration is a bottleneck, you can test on that real data and code. Until then, there's nothing to test.
Split the whole string only when the first two items retrieved from str.split(';', 2) satisfy the conditions:
>>> strs = 'plu;ean;price;quantity'
>>> strs.split(';', 2)
['plu', 'ean', 'price;quantity']
Here split the third item('price;quantity') only if the first two items have satisfied the condition:
>>> lis = ['plu;ean;price;quantity'*1000, 'plu1;ean1;price1;quantity1'*1000]*1000
Normal for-loop, single split of whole string for each item of the list.
>>> %%timeit
for item in lis:
split_item=item.split(";")
if split_item[0] == "plu" or split_item[1] == "ean":pass
...
1 loops, best of 3: 952 ms per loop
List comprehension equivalent to the for-loop above:
>>> %timeit [x for x in (item.split(';') for item in lis) if x[0]== "plu" or x[1]=="ean"]
1 loops, best of 3: 961 ms per loop
Split on-demand:
>>> %timeit [[x] + [y] + z.split(';') for x, y, z in (item.split(';', 2) for item in lis) if x== "plu" or y=="ean"]
1 loops, best of 3: 508 ms per loop
Of course, if the list and strings are small then such optimisation doesn't matter.
EDIT: It turns out that the Regex cache was being a bit unfair to the competition. My bad. Regex is only a small percentage faster.
If you're looking for speed, hcwhsa's answer should be good enough. If you need slightly more, look to re.
import re
from itertools import chain
lis = ['plu;ean;price;quantity'*1000, 'plu1;ean1;price1;quantity1'*100]*1000
matcher = re.compile('^(?:plu(?:;|$)|[^;]*;ean(?:;|$))').match
[l.split(';') for l in lis if matcher(l)]
Timings, for mostly positive results (aka. split is the major cause of slowness):
SETUP="
import re
from itertools import chain
matcher = re.compile('^(?:plu(?:;|$)|[^;]*;ean(?:;|$))').match
lis = ['plu1;ean1;price1;quantity1'+chr(i) for i in range(10000)] + ['plu;ean;price;quantity' for i in range(10000)]
"
python -m timeit -s "$SETUP" "[[x] + [y] + z.split(';') for x, y, z in (item.split(';', 2) for item in lis) if x== 'plu' or y=='ean']"
python -m timeit -s "$SETUP" "[l.split(';') for l in lis if matcher(l)]"
We see mine's a little faster.
10 loops, best of 3: 55 msec per loop
10 loops, best of 3: 49.5 msec per loop
For mostly negative results (most things are filtered):
SETUP="
import re
from itertools import chain
matcher = re.compile('^(?:plu(?:;|$)|[^;]*;ean(?:;|$))').match
lis = ['plu1;ean1;price1;quantity1'+chr(i) for i in range(1000)] + ['plu;ean;price;quantity' for i in range(10000)]
"
python -m timeit -s "$SETUP" "[[x] + [y] + z.split(';') for x, y, z in (item.split(';', 2) for item in lis) if x== 'plu' or y=='ean']"
python -m timeit -s "$SETUP" "[l.split(';') for l in lis if matcher(l)]"
The lead's a touch higher.
10 loops, best of 3: 40.9 msec per loop
10 loops, best of 3: 35.7 msec per loop
If the result will always be unique, use
next([x] + [y] + z.split(';') for x, y, z in (item.split(';', 2) for item in lis) if x== 'plu' or y=='ean')
or the faster Regex version
next(filter(matcher, lis)).split(';')
(use itertools.ifilter on Python 2).
Timings:
SETUP="
import re
from itertools import chain
matcher = re.compile('^(?:plu(?:;|$)|[^;]*;ean(?:;|$))').match
lis = ['plu1;ean1;price1;quantity1'+chr(i) for i in range(10000)] + ['plu;ean;price;quantity'] + ['plu1;ean1;price1;quantity1'+chr(i) for i in range(10000)]
"
python -m timeit -s "$SETUP" "[[x] + [y] + z.split(';') for x, y, z in (item.split(';', 2) for item in lis) if x== 'plu' or y=='ean']"
python -m timeit -s "$SETUP" "next([x] + [y] + z.split(';') for x, y, z in (item.split(';', 2) for item in lis) if x== 'plu' or y=='ean')"
python -m timeit -s "$SETUP" "[l.split(';') for l in lis if matcher(l)]"
python -m timeit -s "$SETUP" "next(filter(matcher, lis)).split(';')"
Results:
10 loops, best of 3: 31.3 msec per loop
100 loops, best of 3: 15.2 msec per loop
10 loops, best of 3: 28.8 msec per loop
100 loops, best of 3: 14.1 msec per loop
So this gives a substantial boost to both methods.
I found a good alternative here.
You can use a combination of map and filter. Try this:
>>>import itertools
>>>splited_list = itertools.imap(lambda x: x.split(";"), your_list)
>>>result = filter(lambda x: filter(lambda x: x[0] == "plu" or x[1] == "string_value", lista)
The first item will create a iterator of elements. And The second one will filter it.
I run a small benchmark in my IPython Notebook shell, and got the following results:
1st test:
With small sizes, the one-line solution works better
2nd test:
With a bigger list, the map/filter solution is slightly better
3rd test:
With a big list and bigger elements, the map/filter solution it`s way better.
I guess the difference in performance continues increasing as the size of the list goes by, untill peaks in 66% more time (in a 10000 elements list trial).
The difference between the map/filter solution and the list comprehension solutions is the number of calls to .split(). Ones calls it 3 times for each item, the other just one, because list comprehensions are just a pythonic way to do map/filter together. I used to use list comprehensions a lot, and thought that i don't knew what the lambda was all about. Untill i discovered that map and list comprehensions are the same thing.
If you don't care about memory usage, you can use regular map instead of imap. It will create the list with splits at once. It will use more memory to store it, but its slightly faster.
Actually, if you don't care about memory usage, you can write the map/filter solution using 2 list comprehensions, and get the same exact result. Checkout:
not sure this was asked before, but I couldn't find an obvious answer. I'm trying to count the number of elements in a list that are equal to a certain value. The problem is that these elements are not of a built-in type. So if I have
class A:
def __init__(self, a, b):
self.a = a
self.b = b
stuff = []
for i in range(1,10):
stuff.append(A(i/2, i%2))
Now I would like a count of the list elements whose field b = 1. I came up with two solutions:
print [e.b for e in stuff].count(1)
and
print len([e for e in stuff if e.b == 1])
Which is the best method? Is there a better alternative? It seems that the count() method does not accept keys (at least in Python version 2.5.1.
Many thanks!
sum(x.b == 1 for x in L)
A boolean (as resulting from comparisons such as x.b == 1) is also an int, with a value of 0 for False, 1 for True, so arithmetic such as summation works just fine.
This is the simplest code, but perhaps not the speediest (only timeit can tell you for sure;-). Consider (simplified case to fit well on command lines, but equivalent):
$ py26 -mtimeit -s'L=[1,2,1,3,1]*100' 'len([x for x in L if x==1])'
10000 loops, best of 3: 56.6 usec per loop
$ py26 -mtimeit -s'L=[1,2,1,3,1]*100' 'sum(x==1 for x in L)'
10000 loops, best of 3: 87.7 usec per loop
So, for this case, the "memory wasteful" approach of generating an extra temporary list and checking its length is actually solidly faster than the simpler, shorter, memory-thrifty one I tend to prefer. Other mixes of list values, Python implementations, availability of memory to "invest" in this speedup, etc, can affect the exact performance, of course.
print sum(1 for e in L if e.b == 1)
I would prefer the second one as it's only looping over the list once.
If you use count() you're looping over the list once to get the b values, and then looping over it again to see how many of them equal 1.
A neat way may to use reduce():
reduce(lambda x,y: x + (1 if y.b == 1 else 0),list,0)
The documentation tells us that reduce() will:
Apply function of two arguments cumulatively to the items of iterable, from left to right, so as to reduce the iterable to a single value.
So we define a lambda that adds one the accumulated value only if the list item's b attribute is 1.
To hide reduce details, you may define a count function:
def count(condition, stuff):
return reduce(lambda s, x: \
s + (1 if condition(x) else 0), stuff, 0)
Then you may use it by providing the condition for counting:
n = count(lambda i: i.b, stuff)
Given the input
name = ['ball', 'jeans', 'ball', 'ball', 'ball', 'jeans']
price = [1, 4, 1, 1, 1, 4]
weight = [2, 2, 2, 3, 2, 2]
First create a defaultdict to record the occurrence
from collections import defaultdict
occurrences = defaultdict(int)
Increment the count
for n, p, w in zip(name, price, weight):
occurrences[(n, p, w)] += 1
Finally count the ones that appear more than once (True will yield 1)
print(sum(cnt > 1 for cnt in occurrences.values())