I have a list of let's say 500,000 entries, with each being a tuple such as (val1, val2).
Currently, I am looping through the list and inside the loop, I have a condition such as:
if val2 == someval:
do_something()
break
However, I was wondering if there was a faster way to loop through elements on a certain condition, such as only looping through items where val2 == someval, rather than the entire list THEN doing the check.
What about taking it from the other side:
if someval in lst:
my_action(somewal)
The test of somewal membership in lst also requires a loop, but this runs in more optimized code in C, so it might be faster.
In [49]: x = 3
In [50]: %timeit x in [1, 2, 3]
10000000 loops, best of 3: 53.8 ns per loop
In [51]: %timeit x == 1 or x == 2 or x == 3
10000000 loops, best of 3: 85.5 ns per loop
In [52]: x = 1
In [53]: %timeit x in [1, 2, 3]
10000000 loops, best of 3: 38.5 ns per loop
In [54]: %timeit x == 1 or x == 2 or x == 3
10000000 loops, best of 3: 38.4 ns per loop
Here you can see, that for numbers, which are "soon" in the test, the time difference is neglectable, but for "later on" it is faster to test membership.
More realistic measurements case: having range of 500000 numbers, testing presence of a number in the middle:
In [64] lst = range(500000)
In [65]: %%timeit
250000 in lst
....:
100 loops, best of 3: 2.66 ms per loop
In [66]: %%timeit
for i in lst:
if i == 250000:
break
....:
100 loops, best of 3: 6.6 ms per loop
The time needed drops down to 40% with membership test x in lst
I'm not so sure there is a faster way. As I see it, you have to first find "val2" in the list, which in my experience requires a loop.
This code for example will iterate the loop and will print val1
only if val2 == someval.
for val1, val2 in some_list:
if val2 != someval:
continue
print val1
You seem to be asking two different questions. One in which the first time you see something equal to someval you break, the other where you only look through those that are equal to someval. For the latter, i.e.:
"However, I was wondering if there was a faster way to loop through elements on a certain condition, such as only looping through items where val2 == someval, rather than the entire list THEN doing the check."
You can do:
for i in filter(lambda t: t[1] == someval, val_list):
stuff
Or via list comprehension:
for i in [x for x in val_list if x[1] == someval]:
stuff
My guess is that one of these is faster.
there is no way to avoid the loop or the if; the answers suggesting there is a faster way are mistaken; filter and list comprehensions will not improve matters a single bit; in fact, unless you use generator expressions (which are lazily evaluated), comprehensions (as well as filter) will make this (potentially much) slower and memory consuming. And generator expressions will not improve the performance either.
there is no way to make it faster other than rewriting in a language such as C or Java or using PyPy or Cython. for x in ...: if x ...: do_smth() is already the fastest possible way. Of course depending on your data, you can actually build the data structure (that has 500 000 items) in a way that is always sorted, so you could potentially have to loop over only the beginning of the list. Or possibly collecting the items satisfying a certain condition into a separate list/set/whatnot, which will yield very good results later by completely avoiding the filtering and a full loop iteration.
You have to "see" all the elements in the list to able to decide if at any point val2 == someval and your list isn't sorted on the second value in the tuple, so looping over all the elements can't be avoided.
However, you can make sure the method you use to loop over the list is as efficient as possible. For instance, instead of using a for statement, you may use list comprehensions to filter out the values that satisfy val2 == someval and then do something if the returned list isn't empty. I say "may" because it really depends on the distribution of your data; whether it's useful to you to have all values for which val2 == someval holds true and performing some action etc.
If you're using Python 3.x then "list comprehensions and generator expressions in Python 3 are actually faster than they were in Python 2".
Related
In doing problems on Leetcode etc it's often required to iterate from the end of the array to the front and I'm used to more traditional programming languages where the for loop is less awkward i.e. for(int i = n; i >= 0; i--) where n is the last index of the array but in Python I find that I'm doing something like this for i in range(n,-1,-1) which looks a bit awkward so I just wanted to know if there was something more elegant. I know that I can reverse the array by doing array[::-1] and then loop as per usual with for range but that's not really what I want to do since it adds computational complexity to the problem.
Use reversed which doesn't create a new list, but instead creates a a reverse iterator, and allows you to iterate in reverse:
a = [1, 2, 3, 4, 5]
for n in reversed(a):
print(n)
Just a comparison of three methods.
array = list(range(100000))
def go_by_index():
for i in range(len(array)-1,-1,-1):
array[i]
def revert_array_directly():
for n in array[::-1]:
n
def reversed_fn():
for n in reversed(array):
n
%timeit go_by_index()
%timeit revert_array_directly()
%timeit reversed_fn()
Outputs
100 loops, best of 3: 4.84 ms per loop
100 loops, best of 3: 2.01 ms per loop
1000 loops, best of 3: 1.49 ms per loop
The time difference is visible, but as you may see, the second and the third option is not that different, especially if the array of interest is of small or medium size.
This is an implementation question for Python 2.7
Say I have a list of integers called nums, and I need to check if all values in nums are equal to zero. nums contains many elements (i.e. more than 10000), with many repeating values.
Using all():
if all(n == 0 for n in set(nums)): # I assume this conversion from list to set helps?
# do something
Using set subtraction:
if set(nums) - {0} == set([]):
# do something
Edit: better way to do the above approach, courtesy of user U9-Forward
if set(nums) == {0}:
# do something
How do the time and space complexities compare for each of these approaches? Is there a more efficient way to check this?
Note: for this case, I am trying to avoid using numpy/pandas.
Any set conversion of nums won't help as it will iterate the entire list:
if all(n == 0 for n in nums):
# ...
is just fine as it stops at the first non-zero element, disregarding the remainder.
Asymptotically, all these approaches are linear with random data.
Implementational details (no repeated function calls on the generator) makes not any(nums) even faster, but that relies on the absence of any other falsy elements but0, e.g. '' or None.
not any(nums) is probably the fastest because it will stop when/if it finds any non-zero element.
Performance comparison:
a = range(10000)
b = [0] * 10000
%timeit not any(a) # 72 ns, fastest for non-zero lists
%timeit not any(b) # 33 ns, fastest for zero lists
%timeit all(n == 0 for n in a) # 365 ns
%timeit all(n == 0 for n in b) # 350 µs
%timeit set(a)=={0} # 228 µs
%timeit set(b)=={0} # 58 µs
If you can use numpy, then (np.array(nums) == 0).all() should do it.
Additionally to #schwobaseggl's answer, second example could be even better:
if set(nums)=={0}:
# do something
I am trying to use map to avoid loop in Python in order to get better performance. my code is
def fun(s):
result = []
for i in range(len(s)-1):
if (s[i:i+2]=="ab"):
result.append(s[:i]+"cd"+s[i+2:])
return result
My guess for the function is:
def fun(s):
return map(lambda s : s[:i]+"cd"+s[i+2:] if s[i:i+2]=="ab", s)
However, I do not know how to associate i with s in this case... And the function above is wrong in syntax.
Anyone could help?
-------------------------------------------------------Add explanation-------------------------------------------------------
A lot of people are confused why I do this. The idea simply comes from Python performance document(see Loop section) and Guido's article. I am just learning.
Big thanks to #gboffi, perfect and neat answer!
A Possible Solution
I've written the function using two auxiliary definitions, but if you want you can write it as a one liner,
def fun(s):
substitute = lambda i: s[:i]+'cd'+s[i+2:]
match = lambda i: s[i:i+2]=='ab'
return map(substitute, filter(match, range(len(s)-1)))
it works by creating a list of indices for which s[i:i+2] matches 'ab' using filter and mapping the string substitution function only for the indices that matched.
Timings
It is apparent that there is a large overhead due to the compilation of the lambdas at each invocation but furtunately it is easy to test this hypotesis
In [41]: def fun(s):
result = []
for i in range(len(s)-1):
if (s[i:i+2]=="ab"):
result.append(s[:i]+"cd"+s[i+2:])
return result
....:
In [42]: def fun2(s):
substitute = lambda i: s[:i]+'cd'+s[i+2:]
match = lambda i: s[i:i+2]=='ab'
return map(substitute, filter(match, range(len(s)-1)))
....:
In [43]: %timeit fun('aaaaaaabaaaabaaabaaab')
100000 loops, best of 3: 2.38 µs per loop
In [44]: %timeit fun2('aaaaaaabaaaabaaabaaab')
100000 loops, best of 3: 3.74 µs per loop
In [45]: %timeit fun('aaaaaaabaaaabaaabaaab'*1000)
10 loops, best of 3: 33.7 ms per loop
In [46]: %timeit fun2('aaaaaaabaaaabaaabaaab'*1000)
10 loops, best of 3: 33.8 ms per loop
In [47]:
for a short string the map version is 50% slower, while for a very long string the timings are asymptotically equal...
First, I don't think that map has a performance advantage over a for loop.
If 's' is large then you may use xrange instead of range https://docs.python.org/2/library/functions.html#xrange
Second map can not filter elements, it can only map them to a new value.
You may use a comprehension instead of a for loop, but i don't think you get a performance advantage either.
Suppose you have a list that is n entries long. This list does not contain uniform data (some entries maybe strings, others integers, or even other lists). Assuming that list contains at least one instance of a given value, what is the fastest to remove all instances in that list?
I can think of two, a list comprehension, or .remove()
[item for item in lst if item != itemToExclude]
for i in range(lst.count(itemToExclude)): lst.remove(itemToExclude)
But I have no sense for which of these will be fastest for an arbitrarily large list, or if there are any other ways. As a side note, if someone could provide some guidelines for determining the speed of methods at a glance, I would greatly appreciate it!
Your method 1. will be faster in general because it iterates the list just once, in C code. The second method iterates through the list for the lst.count call firstly, and iterates from the start again every time lst.remove gets called!
To measure these things, use timeit.
It is also worth mentioning that the two methods you propose are doing slightly different things:
[item for item in lst if item != itemToExclude]
This creates a new list.
for i in range(lst.count(itemToExclude)): lst.remove(itemToExclude)
This modifies the existing list.
Your second solution is much less efficient than your first. count and remove both traverse the list, so to remove N copies of an item, you have to traverse the list N+1 times. Whereas the list comprehension only traverses the list once no matter how many copies there are.
Try this one:
filter(lambda x: x != itemToExclude, lst)
There are no Python-level loops here - the loop, going once over the data, is done "at C speed" (well, in CPython, "the usual" implementation).
test.py:
lst = range(100) * 100
itemToExclude = 1
def do_nothing(lst):
return lst
def listcomp(lst):
return [item for item in lst if item != itemToExclude]
def listgenerator(lst):
return list(item for item in lst if item != itemToExclude)
def remove(lst):
for i in range(lst.count(itemToExclude)):
lst.remove(itemToExclude)
def filter_lambda(lst):
return filter(lambda x: x != itemToExclude, lst)
import operator
import functools
def filter_functools(lst):
return filter(functools.partial(operator.ne, itemToExclude), lst)
lstcopy = list(lst)
remove(lstcopy)
assert(lstcopy == listcomp(list(lst)))
assert(lstcopy == listgenerator(list(lst)))
assert(lstcopy == filter_lambda(list(lst)))
assert(lstcopy == filter_functools(list(lst)))
Results:
$ python -mtimeit "import test; test.do_nothing(list(test.lst))"
10000 loops, best of 3: 26.9 usec per loop
$ python -mtimeit "import test; test.listcomp(list(test.lst))"
1000 loops, best of 3: 686 usec per loop
$ python -mtimeit "import test; test.listgenerator(list(test.lst))"
1000 loops, best of 3: 737 usec per loop
$ python -mtimeit "import test; test.remove(list(test.lst))"
100 loops, best of 3: 8.94 msec per loop
$ python -mtimeit "import test; test.filter_lambda(list(test.lst))"
1000 loops, best of 3: 994 usec per loop
$ python -mtimeit "import test; test.filter_functools(list(test.lst))"
1000 loops, best of 3: 815 usec per loop
So remove loses but the rest are pretty similar: the list comprehension may have the edge over filter. Obviously you can do the same thing for an input size, number of removed items, and type of item to remove, that are all more representative of your real intended use.
Let's say I have a list:
list=['plu;ean;price;quantity','plu1;ean1;price1;quantity1']
I want to iterate over the list + split the list by ";" and put an if clause, like this:
for item in list:
split_item=item.split(";")
if split_item[0] == "string_value" or split_item[1] == "string_value":
do something.....
I was wondering, if this is the fastest way possible? Let's say my initial list is a lot bigger (has a lot more list items). I tried with list comprehensions:
item=[item.split(";") for item in list if item.split(";")[0] == "string_value" or item.split(";")[1] == "string_value"]
But this is actually giving me slower results. The first case is giving me an average of 90ms, while the second one is giving me an average of 130ms.
Am I doing the list comprehension wrong? Is there a faster solution?
I was wondering, if this is the fastest way possible?
No, of course not. You can implement it a lot faster in hand-coded assembly than in Python. So what?
If the "do something..." is not trivial, and there are many matches, the cost to do something 100000 times is going to be a lot more expensive than the cost of looping 500000 times, so finding the fastest way to loop doesn't matter at all.
In fact, just calling split two to three each loop instead of remembering and reusing the result is going to swamp the cost of iteration, and not passing a maxsplit argument when you only care about two results may as well.
So, you're trying to optimize the wrong thing. But what if, after you fix everything else, it turns out that the cost of iteration really does matter here?
Well, you can't use a comprehension directly to speed things up, because comprehensions are for expressions that return values, not statements to do things.
But, if you look at your code, you'll realize you're actually doing three things: splitting each string, then filtering out the ones that don't match, then doing the "do something". So, you can use a comprehension for the first two parts, and then you're only using a slow for loop for the much smaller list of values that passed the filter.
It looks like you tried this, but you made two mistakes.
First, you're better off with a generator expression than a list comprehension—you don't need a list here, just something to iterator over, so don't pay to build one.
Second, you don't want to split the string three times. You can probably find some convoluted way to get the split done once in a single comprehension, but why bother? Just write each step as its own step.
So:
split_items = (item.split(';') for item in items)
filtered_items = (item for item in split_items
if item[0] == "string_value" or item[1] == "string_value")
for item in filtered_items:
do something...
Will this actually be faster? If you can get some real test data, and "do something..." code, that shows that the iteration is a bottleneck, you can test on that real data and code. Until then, there's nothing to test.
Split the whole string only when the first two items retrieved from str.split(';', 2) satisfy the conditions:
>>> strs = 'plu;ean;price;quantity'
>>> strs.split(';', 2)
['plu', 'ean', 'price;quantity']
Here split the third item('price;quantity') only if the first two items have satisfied the condition:
>>> lis = ['plu;ean;price;quantity'*1000, 'plu1;ean1;price1;quantity1'*1000]*1000
Normal for-loop, single split of whole string for each item of the list.
>>> %%timeit
for item in lis:
split_item=item.split(";")
if split_item[0] == "plu" or split_item[1] == "ean":pass
...
1 loops, best of 3: 952 ms per loop
List comprehension equivalent to the for-loop above:
>>> %timeit [x for x in (item.split(';') for item in lis) if x[0]== "plu" or x[1]=="ean"]
1 loops, best of 3: 961 ms per loop
Split on-demand:
>>> %timeit [[x] + [y] + z.split(';') for x, y, z in (item.split(';', 2) for item in lis) if x== "plu" or y=="ean"]
1 loops, best of 3: 508 ms per loop
Of course, if the list and strings are small then such optimisation doesn't matter.
EDIT: It turns out that the Regex cache was being a bit unfair to the competition. My bad. Regex is only a small percentage faster.
If you're looking for speed, hcwhsa's answer should be good enough. If you need slightly more, look to re.
import re
from itertools import chain
lis = ['plu;ean;price;quantity'*1000, 'plu1;ean1;price1;quantity1'*100]*1000
matcher = re.compile('^(?:plu(?:;|$)|[^;]*;ean(?:;|$))').match
[l.split(';') for l in lis if matcher(l)]
Timings, for mostly positive results (aka. split is the major cause of slowness):
SETUP="
import re
from itertools import chain
matcher = re.compile('^(?:plu(?:;|$)|[^;]*;ean(?:;|$))').match
lis = ['plu1;ean1;price1;quantity1'+chr(i) for i in range(10000)] + ['plu;ean;price;quantity' for i in range(10000)]
"
python -m timeit -s "$SETUP" "[[x] + [y] + z.split(';') for x, y, z in (item.split(';', 2) for item in lis) if x== 'plu' or y=='ean']"
python -m timeit -s "$SETUP" "[l.split(';') for l in lis if matcher(l)]"
We see mine's a little faster.
10 loops, best of 3: 55 msec per loop
10 loops, best of 3: 49.5 msec per loop
For mostly negative results (most things are filtered):
SETUP="
import re
from itertools import chain
matcher = re.compile('^(?:plu(?:;|$)|[^;]*;ean(?:;|$))').match
lis = ['plu1;ean1;price1;quantity1'+chr(i) for i in range(1000)] + ['plu;ean;price;quantity' for i in range(10000)]
"
python -m timeit -s "$SETUP" "[[x] + [y] + z.split(';') for x, y, z in (item.split(';', 2) for item in lis) if x== 'plu' or y=='ean']"
python -m timeit -s "$SETUP" "[l.split(';') for l in lis if matcher(l)]"
The lead's a touch higher.
10 loops, best of 3: 40.9 msec per loop
10 loops, best of 3: 35.7 msec per loop
If the result will always be unique, use
next([x] + [y] + z.split(';') for x, y, z in (item.split(';', 2) for item in lis) if x== 'plu' or y=='ean')
or the faster Regex version
next(filter(matcher, lis)).split(';')
(use itertools.ifilter on Python 2).
Timings:
SETUP="
import re
from itertools import chain
matcher = re.compile('^(?:plu(?:;|$)|[^;]*;ean(?:;|$))').match
lis = ['plu1;ean1;price1;quantity1'+chr(i) for i in range(10000)] + ['plu;ean;price;quantity'] + ['plu1;ean1;price1;quantity1'+chr(i) for i in range(10000)]
"
python -m timeit -s "$SETUP" "[[x] + [y] + z.split(';') for x, y, z in (item.split(';', 2) for item in lis) if x== 'plu' or y=='ean']"
python -m timeit -s "$SETUP" "next([x] + [y] + z.split(';') for x, y, z in (item.split(';', 2) for item in lis) if x== 'plu' or y=='ean')"
python -m timeit -s "$SETUP" "[l.split(';') for l in lis if matcher(l)]"
python -m timeit -s "$SETUP" "next(filter(matcher, lis)).split(';')"
Results:
10 loops, best of 3: 31.3 msec per loop
100 loops, best of 3: 15.2 msec per loop
10 loops, best of 3: 28.8 msec per loop
100 loops, best of 3: 14.1 msec per loop
So this gives a substantial boost to both methods.
I found a good alternative here.
You can use a combination of map and filter. Try this:
>>>import itertools
>>>splited_list = itertools.imap(lambda x: x.split(";"), your_list)
>>>result = filter(lambda x: filter(lambda x: x[0] == "plu" or x[1] == "string_value", lista)
The first item will create a iterator of elements. And The second one will filter it.
I run a small benchmark in my IPython Notebook shell, and got the following results:
1st test:
With small sizes, the one-line solution works better
2nd test:
With a bigger list, the map/filter solution is slightly better
3rd test:
With a big list and bigger elements, the map/filter solution it`s way better.
I guess the difference in performance continues increasing as the size of the list goes by, untill peaks in 66% more time (in a 10000 elements list trial).
The difference between the map/filter solution and the list comprehension solutions is the number of calls to .split(). Ones calls it 3 times for each item, the other just one, because list comprehensions are just a pythonic way to do map/filter together. I used to use list comprehensions a lot, and thought that i don't knew what the lambda was all about. Untill i discovered that map and list comprehensions are the same thing.
If you don't care about memory usage, you can use regular map instead of imap. It will create the list with splits at once. It will use more memory to store it, but its slightly faster.
Actually, if you don't care about memory usage, you can write the map/filter solution using 2 list comprehensions, and get the same exact result. Checkout: