Speeding up `any` with list comprehension - python

I am using any with a list comprehension. I would like to break the list comprehension when any returns True. For example,
import time
def f(x):
time.sleep(2)
return x
beginTime = time.time()
result = any([f(x) == 0 for x in [0,1,3,5,7]])
endTime = time.time()
print(endTime - beginTime)
The above code prints 10 seconds although it could break the iteration after first True.

Use a generator expression instead of a list comprehension to avoid forming the list first:
result = any(f(x) == 0 for x in [0,1,3,5,7])
(the square brackets of the list comprehension are gone.)
Note that any has a short-circuiting behaviour in either case, but what differs is the lack of forming the whole list.

You can use a generator, as told by Mustafa, but retrieve only first element of truth.
The generator non necessary must be consumed totally, and walrus operator do the rest
import time
def f(x):
time.sleep(2)
return x
beginTime = time.time()
result = next((wr for x in [0,1,3,5,7] if (wr := f(x)) ==0))
endTime = time.time()
print(endTime - beginTime)
This takes only minimum time to retrieve first ocurrence

Related

In Python, why do these 3 ways of retrieving max element in a list have different speeds?

I was benchmarking the time it takes three seemingly similar algorithms to finish running in Python. They are all trying to find the largest element in an unsorted list. Here's the code:
from time import time
from random import random
LIST_LEN = 1000
my_list = [int(random() * LIST_LEN) for i in range(LIST_LEN)]
# Iterate with an index variable
start_time = time()
my_max = 0
i = 0
while i < len(my_list):
if my_list[i] > my_max:
my_max = my_list[i]
i += 1
my_time = time() - start_time
print(' Indexed:\t', my_time)
# Iterate using iterator pattern
start_time = time()
my_max = 0
for num in my_list:
if num > my_max:
my_max = num
my_time = time() - start_time
print('Iterator:\t', my_time)
# Use built-in max() function
start_time = time()
my_max = max(my_list)
max_time = time() - start_time
print('Built-in:\t', max_time)
And here are the results:
Indexed: 0.0002579689025878906
Iterator: 6.604194641113281e-05
Built-in: 1.7404556274414062e-05
I get similar results every time I run the program, whether or not I'm running locally or on a server. What's up with the notable difference in the three methods? Accessing using indices seems to be about 4 times slower than the iterator pattern, which seems to be about 4 times slower than Python's built-in max() function.
Your first method uses a while loop which needs to do a comparison i < len(my_list), perform the function len(my_list) every while loop, and increase i.
The second method has fewer python operations, therefore faster.
The third method is using built-in C function.
They are all O(N) time complexity.

Speed comparision for iterating over List and Generator in Python

When comparing usage of Python Generators vs List for better performance/ optimisation, i read that Generators are faster to create than list but iterating over list is faster than generator. But I coded an example to test it with small and big sample of data and it contradicts with one another.
When I test speed for iterating over generator and list using 1_000_000_000 where the actual generator will have 500,000,000 numbers. I see the result where Generator iteration is faster than list
from time import time
my_generator = (i for i in range(1_000_000_000) if i % 2 == 0)
start = time()
for i in my_generator:
pass
print("Time for Generator iteration - ", time() - start)
my_list = [i for i in range(1_000_000_000) if i % 2 == 0]
start = time()
for i in my_list:
pass
print("Time for List iteration - ", time() - start)
And the output is:
Time for Generator iteration - 67.49345350265503
Time for List iteration - 89.21837282180786
But if i use small chunk of data 10_000_000 instead of 1_000_000_000 in input, List iteration is faster than Generator.
from time import time
my_generator = (i for i in range(10_000_000) if i % 2 == 0)
start = time()
for i in my_generator:
pass
print("Time for Generator iteration - ", time() - start)
my_list = [i for i in range(10_000_000) if i % 2 == 0]
start = time()
for i in my_list:
pass
print("Time for list iteration - ", time() - start)
The output is:
Time for Generator iteration - 1.0233261585235596
Time for list iteration - 0.11701655387878418
Why is behaviour happening?
After understanding points made by #gimix and #Dani Mesejo, I found the answer. Indeed list iteration is faster than generator iteration
In case of generator, a generator is called like a function call for each iteration we are also calling reminder operation (modulus)for each iteration as it makes it even slower for each call...Whereas in case of list it is calculated during creation itself and iteration is faster.
Thus creation of list might be slower than creation of generator but iteration of list is definitely faster than list
The above code uses time module which is not reliable!!
Now I used timeit for 1_000_000 and for 1_000_000_000 data and in both cases list iteration was faster :
import timeit
mysetup = '''my_generator = (i for i in range(10_000_000) if i % 2 == 0)
'''
mycode = '''
for i in my_generator:
pass
'''
mysetup1 = '''my_list = [i for i in range(10_000_000) if i % 2 == 0]'''
mycode1 = '''
for i in my_list:
pass
'''
print (timeit.timeit(setup = mysetup,
stmt = mycode,
number = 1))
print (timeit.timeit(setup = mysetup1,
stmt = mycode1,
number = 1))
for better understanding of what is the benefit of generators regarding efficiency. suppose that you want to read a file with 10M rows. first you read it with a regular method like below:
from time import time
first_ts = time()
def regular_file_reader(filename):
file_ = open(filename, "r")
data = file_.readlines()
file_.close()
return data
for row in regular_file_reader("sample_file.csv"):
print(row)
global second_time
second_time = time()
break
print(second_time - first_ts)
as you can see after reading first line of the file we break ed from loop, because that's what generators make difference "just reading first element". for iterating on next ones it may be even inefficient.
def generator_file_reader(filename):
with open(filename, "r") as f:
for row in f:
yield row
for row in generator_file_reader("sample_file.csv"):
print(row)
global second_time
second_time = time()
break
print(second_time - first_ts)
in this case as generator just read first line not the whole file, using generator is way more faster.

Why passing a list as a parameter performs better than passing a generator?

I was making an answer for this question, and when I tested the timing for my solution I came up with a contradiction to what I thought was correct.
The guy who made the question wanted to find a way to know how many different lists were contained within another list. (for more information, you can check the question)
My answer was basically this function:
def how_many_different_lists(lists):
s = set(str(list_) for list_ in lists)
return len(s)
Now, the situation came when I measured the time it takes to run and I compared it against basically the same function, but passing a list instead of a generator as a parameter to set():
def the_other_function(lists):
s = set([str(list_) for list_ in lists])
return len(s)
This is the decorator I use for testing functions:
import time
def timer(func):
def func_decorated(*args):
start_time = time.clock()
result = func(*args)
print(time.clock() - start_time, "seconds")
return result
return func_decorated
And this were the results for the given input:
>>> list1 = [[1,2,3],[1,2,3],[1,2,2],[1,2,2]]
>>> how_many_different_lists(list1)
6.916326725558974e-05 seconds
2
>>> the_other_function(list1)
3.882067261429256e-05 seconds
2
Even for larger lists:
# (52 elements)
>>> list2= [[1,2,3],[1,2,3],[1,2,2],[1,2,2],[1,2,3],[1,2,3],[1,2,2],[1,2,2],[1,2,3],[1,2,3],[1,2,2],[1,2,2],[1,2,3],[1,2,3],[1,2,2],[1,2,2],[1,2,3],[1,2,3],[1,2,2],[1,2,2],[1,2,3],[1,2,3],[1,2,2],[1,2,2],[1,2,3],[1,2,3],[1,2,2],[1,2,2],[1,2,3],[1,2,3],[1,2,2],[1,2,2],[1,2,3],[1,2,3],[1,2,2],[1,2,2],[1,2,3],[1,2,3],[1,2,2],[1,2,2],[1,2,3],[1,2,3],[1,2,2],[1,2,2],[1,2,3],[1,2,3],[1,2,2],[1,2,2],[1,2,3],[1,2,3],[1,2,2],[1,2,2]]
>>> how_many_different_lists(list2)
0.00023560132331112982 seconds
2
>>> the_other_function(list2)
0.00021329059177332965 seconds
2
Now, my question is: Why is the second example faster than the first one? Aren't generators supposed to be faster due to the fact that the produce the elements "on demand"? I used to think that making a list and iterating through it was slower.
PS: I have tested many many times getting basically the same results.
I have been benchmarking your functions:
from simple_benchmark import BenchmarkBuilder
from random import choice
b = BenchmarkBuilder()
from operator import setitem
#b.add_function()
def how_many_different_lists(lists):
s = set(str(list_) for list_ in lists)
return len(s)
#b.add_function()
def the_other_function(lists):
s = set([str(list_) for list_ in lists])
return len(s)
#b.add_arguments('Number of lists in the list')
def argument_provider():
for exp in range(2, 18):
size = 2**exp
yield size, [list(range(choice(range(100)))) for _ in range(size)]
r = b.run()
r.plot()
Generators are lazy because generator expression will create the items on the fly in comparison with list comprehension which will create the entire list in memory. You can read more here: Generator Expressions vs. List Comprehension
As you can see from the benchmark there is not such a big difference between them.

Comparing Python integers

I have a list of integers below a thousand and a hash function transforming it into a single integer but much larger one. There is a hash function code below:
def hash_function(lst):
hsh = 0
for i, item in enumerate(lst):
hsh += item * pow(10, i * 3)
return hsh
Assume that lst has about 4-5 items.
Is comparing two integers more efficient than comparing two lists of much smaller integers? Why or why not? I have to compare a few hundreds thousands of hashes.
I came up with a quick test to show the differences between the builtin list comparisons, and your hash function.
import time
import random
import sys
def compareRegular(a, b):
return a == b
def listHash(lst):
hsh = 0
for i, item in enumerate(lst):
hsh += item * pow(10, i * 3)
return hsh
def compareHash(a, b):
return listHash(a) == listHash(b)
def compareLists(hugeList, comparison):
output = []
for i, lstA in enumerate(hugeList[:-1]):
for j, lstB in enumerate(hugeList[i + 1:]):
if comparison(lstA, lstB):
output.append([i, j])
return output
def genList(minValue, maxValue, numElements):
output = []
for _ in range(1000):
smallList = []
for _ in range(numElements):
smallList.append(random.randint(minValue, maxValue))
output.append(smallList)
return output
random.seed(123)
hugeListA = genList(-sys.maxint - 1, sys.maxint, 5)
hugeListB = genList(0, 100, 5)
print "Test with huge numbers in our list"
start = time.time()
regularOut = compareLists(hugeListA, compareRegular)
end = time.time()
print "Regular compare takes:", end - start
start = time.time()
hashOut = compareLists(hugeListA, compareHash)
end = time.time()
print "Regular compare takes:", end - start
print "Are both outputs the same?", regularOut == hashOut
print
print "Test with smaller number in our lists"
start = time.time()
regularOut = compareLists(hugeListB, compareRegular)
end = time.time()
print "Regular compare takes:", end - start
start = time.time()
hashOut = compareLists(hugeListB, compareHash)
end = time.time()
print "Regular compare takes:", end - start
print "Are both outputs the same?", regularOut == hashOut
On my computer this outputs:
Test with huge numbers in our list
Regular compare takes: 0.0940001010895
Regular compare takes: 3.38999986649
Are both outputs the same? True
Test with smaller number in our lists
Regular compare takes: 0.0789999961853
Regular compare takes: 3.01400017738
Are both outputs the same? True
The people who develop python definitely spend a bunch of time thinking about stuff like this. I personally have no idea how the built in list comparison actually works, but I'm pretty certain it doesn't execute within the Python interpreter like your hash function will. Many python built in functions and types are backed by native executing C code, the list comparison function likely falls into this category.
Even if you implemented your hash function in a similar way and had it executed natively, it would still likely be slower. You're basically looking at N calls to pow or N number comparisons. Even if they are variable size integers, memcmp certainly won't take longer than loading the same value from memory and performing floating point operations on it.

Find the smallest positive number not in list

I have a list in python like this:
myList = [1,14,2,5,3,7,8,12]
How can I easily find the first unused value? (in this case '4')
I came up with several different ways:
Iterate the first number not in set
I didn't want to get the shortest code (which might be the set-difference trickery) but something that could have a good running time.
This might be one of the best proposed here, my tests show that it might be substantially faster - especially if the hole is in the beginning - than the set-difference approach:
from itertools import count, filterfalse # ifilterfalse on py2
A = [1,14,2,5,3,7,8,12]
print(next(filterfalse(set(A).__contains__, count(1))))
The array is turned into a set, whose __contains__(x) method corresponds to x in A. count(1) creates a counter that starts counting from 1 to infinity. Now, filterfalse consumes the numbers from the counter, until a number is found that is not in the set; when the first number is found that is not in the set it is yielded by next()
Timing for len(a) = 100000, randomized and the sought-after number is 8:
>>> timeit(lambda: next(filterfalse(set(a).__contains__, count(1))), number=100)
0.9200698399945395
>>> timeit(lambda: min(set(range(1, len(a) + 2)) - set(a)), number=100)
3.1420603669976117
Timing for len(a) = 100000, ordered and the first free is 100001
>>> timeit(lambda: next(filterfalse(set(a).__contains__, count(1))), number=100)
1.520096342996112
>>> timeit(lambda: min(set(range(1, len(a) + 2)) - set(a)), number=100)
1.987783643999137
(note that this is Python 3 and range is the py2 xrange)
Use heapq
The asymptotically good answer: heapq with enumerate
from heapq import heapify, heappop
heap = list(A)
heapify(heap)
from heapq import heapify, heappop
from functools import partial
# A = [1,2,3] also works
A = [1,14,2,5,3,7,8,12]
end = 2 ** 61 # these are different and neither of them can be the
sentinel = 2 ** 62 # first gap (unless you have 2^64 bytes of memory).
heap = list(A)
heap.append(end)
heapify(heap)
print(next(n for n, v in enumerate(
iter(partial(heappop, heap), sentinel), 1) if n != v))
Now, the one above could be the preferred solution if written in C, but heapq is written in Python and most probably slower than many other alternatives that mainly use C code.
Just sort and enumerate to find the first not matching
Or the simple answer with good constants for O(n lg n)
next(i for i, e in enumerate(sorted(A) + [ None ], 1) if i != e)
This might be fastest of all if the list is almost sorted because of how the Python Timsort works, but for randomized the set-difference and iterating the first not in set are faster.
The + [ None ] is necessary for the edge cases of there being no gaps (e.g. [1,2,3]).
This makes use of the property of sets
>>> l = [1,2,3,5,7,8,12,14]
>>> m = range(1,len(l))
>>> min(set(m)-set(l))
4
I would suggest you to use a generator and use enumerate to determine the missing element
>>> next(a for a, b in enumerate(myList, myList[0]) if a != b)
4
enumerate maps the index with the element so your goal is to determine that element which differs from its index.
Note, I am also assuming that the elements may not start with a definite value, in this case which is 1, and if it is so, you can simplify the expression further as
>>> next(a for a, b in enumerate(myList, 1) if a != b)
4
A for loop with the list will do it.
l = [1,14,2,5,3,7,8,12]
for i in range(1, max(l)):
if i not in l: break
print(i) # result 4
Don't know how efficient, but why not use an xrange as a mask and use set minus?
>>> myList = [1,14,2,5,3,7,8,12]
>>> min(set(xrange(1, len(myList) + 1)) - set(myList))
4
You're only creating a set as big as myList, so it can't be that bad :)
This won't work for "full" lists:
>>> myList = range(1, 5)
>>> min(set(xrange(1, len(myList) + 1)) - set(myList))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: min() arg is an empty sequence
But the fix to return the next value is simple (add one more to the masked set):
>>> min(set(xrange(1, len(myList) + 2)) - set(myList))
5
import itertools as it
next(i for i in it.count() if i not in mylist)
I like this because it reads very closely to what you're trying to do: "start counting, keep going until you reach a number that isn't in the list, then tell me that number". However, this is quadratic since testing i not in mylist is linear.
Solutions using enumerate are linear, but rely on the list being sorted and no value being repeated. Sorting first makes it O(n log n) overall, which is still better than quadratic. However, if you can assume the values are distinct, then you could put them into a set first:
myset = set(mylist)
next(i for i in it.count() if i not in myset)
Since set containment checks are roughly constant time, this will be linear overall.
I just solved this in a probably non pythonic way
def solution(A):
# Const-ish to improve readability
MIN = 1
if not A: return MIN
# Save re-computing MAX
MAX = max(A)
# Loop over all entries with minimum of 1 starting at 1
for num in range(1, MAX):
# going for greatest missing number return optimistically (minimum)
# If order needs to switch, then use max as start and count backwards
if num not in A: return num
# In case the max is < 0 double wrap max with minimum return value
return max(MIN, MAX+1)
I think it reads quite well
My effort, no itertools. Sets "current" to be the one less than the value you are expecting.
list = [1,2,3,4,5,7,8]
current = list[0]-1
for i in list:
if i != current+1:
print current+1
break
current = i
The naive way is to traverse the list which is an O(n) solution. However, since the list is sorted, you can use this feature to perform binary search (a modified version for it). Basically, you are looking for the last occurance of A[i] = i.
The pseudo algorithm will be something like:
binarysearch(A):
start = 0
end = len(A) - 1
while(start <= end ):
mid = (start + end) / 2
if(A[mid] == mid):
result = A[mid]
start = mid + 1
else: #A[mid] > mid since there is no way A[mid] is less than mid
end = mid - 1
return (result + 1)
This is an O(log n) solution. I assumed lists are one indexed. You can modify the indices accordingly
EDIT: if the list is not sorted, you can use the heapq python library and store the list in a min-heap and then pop the elements one by one
pseudo code
H = heapify(A) //Assuming A is the list
count = 1
for i in range(len(A)):
if(H.pop() != count): return count
count += 1
sort + reduce to the rescue!
from functools import reduce # python3
myList = [1,14,2,5,3,7,8,12]
res = 1 + reduce(lambda x, y: x if y-x>1 else y, sorted(myList), 0)
print(res)
Unfortunatelly it won't stop after match is found and will iterate whole list.
Faster (but less fun) is to use for loop:
myList = [1,14,2,5,3,7,8,12]
res = 0
for num in sorted(myList):
if num - res > 1:
break
res = num
res = res + 1
print(res)
you can try this
for i in range(1,max(arr1)+2):
if i not in arr1:
print(i)
break
Easy to read, easy to understand, gets the job done:
def solution(A):
smallest = 1
unique = set(A)
for int in unique:
if int == smallest:
smallest += 1
return smallest
Keep incrementing a counter in a loop until you find the first positive integer that's not in the list.
def getSmallestIntNotInList(number_list):
"""Returns the smallest positive integer that is not in a given list"""
i = 0
while True:
i += 1
if i not in number_list:
return i
print(getSmallestIntNotInList([1,14,2,5,3,7,8,12]))
# 4
I found that this had the fastest performance compared to other answers on this post. I tested using timeit in Python 3.10.8. My performance results can be seen below:
import timeit
def findSmallestIntNotInList(number_list):
# Infinite while-loop until first number is found
i = 0
while True:
i += 1
if i not in number_list:
return i
t = timeit.Timer(lambda: findSmallestIntNotInList([1,14,2,5,3,7,8,12]))
print('Execution time:', t.timeit(100000), 'seconds')
# Execution time: 0.038100800011307 seconds
import timeit
def findSmallestIntNotInList(number_list):
# Loop with a range to len(number_list)+1
for i in range (1, len(number_list)+1):
if i not in number_list:
return i
t = timeit.Timer(lambda: findSmallestIntNotInList([1,14,2,5,3,7,8,12]))
print('Execution time:', t.timeit(100000), 'seconds')
# Execution time: 0.05068870005197823 seconds
import timeit
def findSmallestIntNotInList(number_list):
# Loop with a range to max(number_list) (by silgon)
# https://stackoverflow.com/a/49649558/3357935
for i in range (1, max(number_list)):
if i not in number_list:
return i
t = timeit.Timer(lambda: findSmallestIntNotInList([1,14,2,5,3,7,8,12]))
print('Execution time:', t.timeit(100000), 'seconds')
# Execution time: 0.06317249999847263 seconds
import timeit
from itertools import count, filterfalse
def findSmallestIntNotInList(number_list):
# iterate the first number not in set (by Antti Haapala -- Слава Україні)
# https://stackoverflow.com/a/28178803/3357935
return(next(filterfalse(set(number_list).__contains__, count(1))))
t = timeit.Timer(lambda: findSmallestIntNotInList([1,14,2,5,3,7,8,12]))
print('Execution time:', t.timeit(100000), 'seconds')
# Execution time: 0.06515420007053763 seconds
import timeit
def findSmallestIntNotInList(number_list):
# Use property of sets (by Bhargav Rao)
# https://stackoverflow.com/a/28176962/3357935
m = range(1, len(number_list))
return min(set(m)-set(number_list))
t = timeit.Timer(lambda: findSmallestIntNotInList([1,14,2,5,3,7,8,12]))
print('Execution time:', t.timeit(100000), 'seconds')
# Execution time: 0.08586219989228994 seconds
The easiest way would be just to loop through the sorted list and check if the index is equal the value and if not return the index as solution.
This would have complexity O(nlogn) because of the sorting:
for index,value in enumerate(sorted(myList)):
if index is not value:
print(index)
break
Another option is to use python sets which are somewhat dictionaries without values, just keys. In dictionaries you can look for a key in constant time which make the whol solution look like the following, having only linear complexity O(n):
mySet = set(myList)
for i in range(len(mySet)):
if i not in mySet:
print(i)
break
Edit:
If the solution should also deal with lists where no number is missing (e.g. [0,1]) and output the next following number and should also correctly consider 0, then a complete solution would be:
def find_smallest_positive_number_not_in_list(myList):
mySet = set(myList)
for i in range(1, max(mySet)+2):
if i not in mySet:
return i
A solution that returns all those values is
free_values = set(range(1, max(L))) - set(L)
it does a full scan, but those loops are implemented in C and unless the list or its maximum value are huge this will be a win over more sophisticated algorithms performing the looping in Python.
Note that if this search is needed to implement "reuse" of IDs then keeping a free list around and maintaining it up-to-date (i.e. adding numbers to it when deleting entries and picking from it when reusing entries) is a often a good idea.
The following solution loops all numbers in between 1 and the length of the input list and breaks the loop whenever a number is not found inside it. Otherwise the result is the length of the list plus one.
listOfNumbers=[1,14,2,5,3,7,8,12]
for i in range(1, len(listOfNumbers)+1):
if not i in listOfNumbers:
nextNumber=i
break
else:
nextNumber=len(listOfNumbers)+1

Categories