I have a list in python like this:
myList = [1,14,2,5,3,7,8,12]
How can I easily find the first unused value? (in this case '4')
I came up with several different ways:
Iterate the first number not in set
I didn't want to get the shortest code (which might be the set-difference trickery) but something that could have a good running time.
This might be one of the best proposed here, my tests show that it might be substantially faster - especially if the hole is in the beginning - than the set-difference approach:
from itertools import count, filterfalse # ifilterfalse on py2
A = [1,14,2,5,3,7,8,12]
print(next(filterfalse(set(A).__contains__, count(1))))
The array is turned into a set, whose __contains__(x) method corresponds to x in A. count(1) creates a counter that starts counting from 1 to infinity. Now, filterfalse consumes the numbers from the counter, until a number is found that is not in the set; when the first number is found that is not in the set it is yielded by next()
Timing for len(a) = 100000, randomized and the sought-after number is 8:
>>> timeit(lambda: next(filterfalse(set(a).__contains__, count(1))), number=100)
0.9200698399945395
>>> timeit(lambda: min(set(range(1, len(a) + 2)) - set(a)), number=100)
3.1420603669976117
Timing for len(a) = 100000, ordered and the first free is 100001
>>> timeit(lambda: next(filterfalse(set(a).__contains__, count(1))), number=100)
1.520096342996112
>>> timeit(lambda: min(set(range(1, len(a) + 2)) - set(a)), number=100)
1.987783643999137
(note that this is Python 3 and range is the py2 xrange)
Use heapq
The asymptotically good answer: heapq with enumerate
from heapq import heapify, heappop
heap = list(A)
heapify(heap)
from heapq import heapify, heappop
from functools import partial
# A = [1,2,3] also works
A = [1,14,2,5,3,7,8,12]
end = 2 ** 61 # these are different and neither of them can be the
sentinel = 2 ** 62 # first gap (unless you have 2^64 bytes of memory).
heap = list(A)
heap.append(end)
heapify(heap)
print(next(n for n, v in enumerate(
iter(partial(heappop, heap), sentinel), 1) if n != v))
Now, the one above could be the preferred solution if written in C, but heapq is written in Python and most probably slower than many other alternatives that mainly use C code.
Just sort and enumerate to find the first not matching
Or the simple answer with good constants for O(n lg n)
next(i for i, e in enumerate(sorted(A) + [ None ], 1) if i != e)
This might be fastest of all if the list is almost sorted because of how the Python Timsort works, but for randomized the set-difference and iterating the first not in set are faster.
The + [ None ] is necessary for the edge cases of there being no gaps (e.g. [1,2,3]).
This makes use of the property of sets
>>> l = [1,2,3,5,7,8,12,14]
>>> m = range(1,len(l))
>>> min(set(m)-set(l))
4
I would suggest you to use a generator and use enumerate to determine the missing element
>>> next(a for a, b in enumerate(myList, myList[0]) if a != b)
4
enumerate maps the index with the element so your goal is to determine that element which differs from its index.
Note, I am also assuming that the elements may not start with a definite value, in this case which is 1, and if it is so, you can simplify the expression further as
>>> next(a for a, b in enumerate(myList, 1) if a != b)
4
A for loop with the list will do it.
l = [1,14,2,5,3,7,8,12]
for i in range(1, max(l)):
if i not in l: break
print(i) # result 4
Don't know how efficient, but why not use an xrange as a mask and use set minus?
>>> myList = [1,14,2,5,3,7,8,12]
>>> min(set(xrange(1, len(myList) + 1)) - set(myList))
4
You're only creating a set as big as myList, so it can't be that bad :)
This won't work for "full" lists:
>>> myList = range(1, 5)
>>> min(set(xrange(1, len(myList) + 1)) - set(myList))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: min() arg is an empty sequence
But the fix to return the next value is simple (add one more to the masked set):
>>> min(set(xrange(1, len(myList) + 2)) - set(myList))
5
import itertools as it
next(i for i in it.count() if i not in mylist)
I like this because it reads very closely to what you're trying to do: "start counting, keep going until you reach a number that isn't in the list, then tell me that number". However, this is quadratic since testing i not in mylist is linear.
Solutions using enumerate are linear, but rely on the list being sorted and no value being repeated. Sorting first makes it O(n log n) overall, which is still better than quadratic. However, if you can assume the values are distinct, then you could put them into a set first:
myset = set(mylist)
next(i for i in it.count() if i not in myset)
Since set containment checks are roughly constant time, this will be linear overall.
I just solved this in a probably non pythonic way
def solution(A):
# Const-ish to improve readability
MIN = 1
if not A: return MIN
# Save re-computing MAX
MAX = max(A)
# Loop over all entries with minimum of 1 starting at 1
for num in range(1, MAX):
# going for greatest missing number return optimistically (minimum)
# If order needs to switch, then use max as start and count backwards
if num not in A: return num
# In case the max is < 0 double wrap max with minimum return value
return max(MIN, MAX+1)
I think it reads quite well
My effort, no itertools. Sets "current" to be the one less than the value you are expecting.
list = [1,2,3,4,5,7,8]
current = list[0]-1
for i in list:
if i != current+1:
print current+1
break
current = i
The naive way is to traverse the list which is an O(n) solution. However, since the list is sorted, you can use this feature to perform binary search (a modified version for it). Basically, you are looking for the last occurance of A[i] = i.
The pseudo algorithm will be something like:
binarysearch(A):
start = 0
end = len(A) - 1
while(start <= end ):
mid = (start + end) / 2
if(A[mid] == mid):
result = A[mid]
start = mid + 1
else: #A[mid] > mid since there is no way A[mid] is less than mid
end = mid - 1
return (result + 1)
This is an O(log n) solution. I assumed lists are one indexed. You can modify the indices accordingly
EDIT: if the list is not sorted, you can use the heapq python library and store the list in a min-heap and then pop the elements one by one
pseudo code
H = heapify(A) //Assuming A is the list
count = 1
for i in range(len(A)):
if(H.pop() != count): return count
count += 1
sort + reduce to the rescue!
from functools import reduce # python3
myList = [1,14,2,5,3,7,8,12]
res = 1 + reduce(lambda x, y: x if y-x>1 else y, sorted(myList), 0)
print(res)
Unfortunatelly it won't stop after match is found and will iterate whole list.
Faster (but less fun) is to use for loop:
myList = [1,14,2,5,3,7,8,12]
res = 0
for num in sorted(myList):
if num - res > 1:
break
res = num
res = res + 1
print(res)
you can try this
for i in range(1,max(arr1)+2):
if i not in arr1:
print(i)
break
Easy to read, easy to understand, gets the job done:
def solution(A):
smallest = 1
unique = set(A)
for int in unique:
if int == smallest:
smallest += 1
return smallest
Keep incrementing a counter in a loop until you find the first positive integer that's not in the list.
def getSmallestIntNotInList(number_list):
"""Returns the smallest positive integer that is not in a given list"""
i = 0
while True:
i += 1
if i not in number_list:
return i
print(getSmallestIntNotInList([1,14,2,5,3,7,8,12]))
# 4
I found that this had the fastest performance compared to other answers on this post. I tested using timeit in Python 3.10.8. My performance results can be seen below:
import timeit
def findSmallestIntNotInList(number_list):
# Infinite while-loop until first number is found
i = 0
while True:
i += 1
if i not in number_list:
return i
t = timeit.Timer(lambda: findSmallestIntNotInList([1,14,2,5,3,7,8,12]))
print('Execution time:', t.timeit(100000), 'seconds')
# Execution time: 0.038100800011307 seconds
import timeit
def findSmallestIntNotInList(number_list):
# Loop with a range to len(number_list)+1
for i in range (1, len(number_list)+1):
if i not in number_list:
return i
t = timeit.Timer(lambda: findSmallestIntNotInList([1,14,2,5,3,7,8,12]))
print('Execution time:', t.timeit(100000), 'seconds')
# Execution time: 0.05068870005197823 seconds
import timeit
def findSmallestIntNotInList(number_list):
# Loop with a range to max(number_list) (by silgon)
# https://stackoverflow.com/a/49649558/3357935
for i in range (1, max(number_list)):
if i not in number_list:
return i
t = timeit.Timer(lambda: findSmallestIntNotInList([1,14,2,5,3,7,8,12]))
print('Execution time:', t.timeit(100000), 'seconds')
# Execution time: 0.06317249999847263 seconds
import timeit
from itertools import count, filterfalse
def findSmallestIntNotInList(number_list):
# iterate the first number not in set (by Antti Haapala -- Слава Україні)
# https://stackoverflow.com/a/28178803/3357935
return(next(filterfalse(set(number_list).__contains__, count(1))))
t = timeit.Timer(lambda: findSmallestIntNotInList([1,14,2,5,3,7,8,12]))
print('Execution time:', t.timeit(100000), 'seconds')
# Execution time: 0.06515420007053763 seconds
import timeit
def findSmallestIntNotInList(number_list):
# Use property of sets (by Bhargav Rao)
# https://stackoverflow.com/a/28176962/3357935
m = range(1, len(number_list))
return min(set(m)-set(number_list))
t = timeit.Timer(lambda: findSmallestIntNotInList([1,14,2,5,3,7,8,12]))
print('Execution time:', t.timeit(100000), 'seconds')
# Execution time: 0.08586219989228994 seconds
The easiest way would be just to loop through the sorted list and check if the index is equal the value and if not return the index as solution.
This would have complexity O(nlogn) because of the sorting:
for index,value in enumerate(sorted(myList)):
if index is not value:
print(index)
break
Another option is to use python sets which are somewhat dictionaries without values, just keys. In dictionaries you can look for a key in constant time which make the whol solution look like the following, having only linear complexity O(n):
mySet = set(myList)
for i in range(len(mySet)):
if i not in mySet:
print(i)
break
Edit:
If the solution should also deal with lists where no number is missing (e.g. [0,1]) and output the next following number and should also correctly consider 0, then a complete solution would be:
def find_smallest_positive_number_not_in_list(myList):
mySet = set(myList)
for i in range(1, max(mySet)+2):
if i not in mySet:
return i
A solution that returns all those values is
free_values = set(range(1, max(L))) - set(L)
it does a full scan, but those loops are implemented in C and unless the list or its maximum value are huge this will be a win over more sophisticated algorithms performing the looping in Python.
Note that if this search is needed to implement "reuse" of IDs then keeping a free list around and maintaining it up-to-date (i.e. adding numbers to it when deleting entries and picking from it when reusing entries) is a often a good idea.
The following solution loops all numbers in between 1 and the length of the input list and breaks the loop whenever a number is not found inside it. Otherwise the result is the length of the list plus one.
listOfNumbers=[1,14,2,5,3,7,8,12]
for i in range(1, len(listOfNumbers)+1):
if not i in listOfNumbers:
nextNumber=i
break
else:
nextNumber=len(listOfNumbers)+1
Related
I was trying an online test. the test asked to write a function that given a list of up to 100000 integers whose range is 1 to 100000, would find the first missing integer.
for example, if the list is [1,4,5,2] the output should be 3.
I iterated over the list as follow
def find_missing(num)
for i in range(1, 100001):
if i not in num:
return i
the feedback I receives is the code is not efficient in handling big lists.
I am quite new and I couldnot find an answer, how can I iterate more efficiently?
The first improvement would be to make yours linear by using a set for the repeated membership test:
def find_missing(nums)
s = set(nums)
for i in range(1, 100001):
if i not in s:
return i
Given how C-optimized python sorting is, you could also do sth like:
def find_missing(nums)
s = sorted(set(nums))
return next(i for i, n in enumerate(s, 1) if i != n)
But both of these are fairly space inefficient as they create a new collection. You can avoid that with an in-place sort:
from itertools import groupby
def find_missing(nums):
nums.sort() # in-place
return next(i for i, (k, _) in enumerate(groupby(nums), 1) if i != k)
For any range of numbers, the sum is given by Gauss's formula:
# sum of all numbers up to and including nums[-1] minus
# sum of all numbers up to but not including nums[-1]
expected = nums[-1] * (nums[-1] + 1) // 2 - nums[0] * (nums[0] - 1) // 2
If a number is missing, the actual sum will be
actual = sum(nums)
The difference is the missing number:
result = expected - actual
This compulation is O(n), which is as efficient as you can get. expected is an O(1) computation, while actual has to actually add up the elements.
A somewhat slower but similar complexity approach would be to step along the sequence in lockstep with either a range or itertools.count:
for a, e in zip(nums, range(nums[0], len(nums) + nums[0])):
if a != e:
return e # or break if not in a function
Notice the difference between a single comparison a != e, vs a linear containment check like e in nums, which has to iterate on average through half of nums to get the answer.
You can use Counter to count every occurrence of your list. The minimum number with occurrence 0 will be your output. For example:
from collections import Counter
def find_missing():
count = Counter(your_list)
keys = count.keys() #list of every element in increasing order
main_list = list(range(1:100000)) #the list of values from 1 to 100k
missing_numbers = list(set(main_list) - set(keys))
your_output = min(missing_numbers)
return your_output
Given a list of integers, I want to find which number is the closest to a number I give in input:
>>> myList = [4, 1, 88, 44, 3]
>>> myNumber = 5
>>> takeClosest(myList, myNumber)
...
4
Is there any quick way to do this?
If we are not sure that the list is sorted, we could use the built-in min() function, to find the element which has the minimum distance from the specified number.
>>> min(myList, key=lambda x:abs(x-myNumber))
4
Note that it also works with dicts with int keys, like {1: "a", 2: "b"}. This method takes O(n) time.
If the list is already sorted, or you could pay the price of sorting the array once only, use the bisection method illustrated in #Lauritz's answer which only takes O(log n) time (note however checking if a list is already sorted is O(n) and sorting is O(n log n).)
I'll rename the function take_closest to conform with PEP8 naming conventions.
If you mean quick-to-execute as opposed to quick-to-write, min should not be your weapon of choice, except in one very narrow use case. The min solution needs to examine every number in the list and do a calculation for each number. Using bisect.bisect_left instead is almost always faster.
The "almost" comes from the fact that bisect_left requires the list to be sorted to work. Hopefully, your use case is such that you can sort the list once and then leave it alone. Even if not, as long as you don't need to sort before every time you call take_closest, the bisect module will likely come out on top. If you're in doubt, try both and look at the real-world difference.
from bisect import bisect_left
def take_closest(myList, myNumber):
"""
Assumes myList is sorted. Returns closest value to myNumber.
If two numbers are equally close, return the smallest number.
"""
pos = bisect_left(myList, myNumber)
if pos == 0:
return myList[0]
if pos == len(myList):
return myList[-1]
before = myList[pos - 1]
after = myList[pos]
if after - myNumber < myNumber - before:
return after
else:
return before
Bisect works by repeatedly halving a list and finding out which half myNumber has to be in by looking at the middle value. This means it has a running time of O(log n) as opposed to the O(n) running time of the highest voted answer. If we compare the two methods and supply both with a sorted myList, these are the results:
$ python -m timeit -s "
from closest import take_closest
from random import randint
a = range(-1000, 1000, 10)" "take_closest(a, randint(-1100, 1100))"
100000 loops, best of 3: 2.22 usec per loop
$ python -m timeit -s "
from closest import with_min
from random import randint
a = range(-1000, 1000, 10)" "with_min(a, randint(-1100, 1100))"
10000 loops, best of 3: 43.9 usec per loop
So in this particular test, bisect is almost 20 times faster. For longer lists, the difference will be greater.
What if we level the playing field by removing the precondition that myList must be sorted? Let's say we sort a copy of the list every time take_closest is called, while leaving the min solution unaltered. Using the 200-item list in the above test, the bisect solution is still the fastest, though only by about 30%.
This is a strange result, considering that the sorting step is O(n log(n))! The only reason min is still losing is that the sorting is done in highly optimalized c code, while min has to plod along calling a lambda function for every item. As myList grows in size, the min solution will eventually be faster. Note that we had to stack everything in its favour for the min solution to win.
>>> takeClosest = lambda num,collection:min(collection,key=lambda x:abs(x-num))
>>> takeClosest(5,[4,1,88,44,3])
4
A lambda is a special way of writing an "anonymous" function (a function that doesn't have a name). You can assign it any name you want because a lambda is an expression.
The "long" way of writing the the above would be:
def takeClosest(num,collection):
return min(collection,key=lambda x:abs(x-num))
def closest(list, Number):
aux = []
for valor in list:
aux.append(abs(Number-valor))
return aux.index(min(aux))
This code will give you the index of the closest number of Number in the list.
The solution given by KennyTM is the best overall, but in the cases you cannot use it (like brython), this function will do the work
Iterate over the list and compare the current closest number with abs(currentNumber - myNumber):
def takeClosest(myList, myNumber):
closest = myList[0]
for i in range(1, len(myList)):
if abs(i - myNumber) < closest:
closest = i
return closest
def find_nearest(array, value):
array = np.asarray(array)
idx = (np.abs(array - value)).argmin()
return array[idx]
run it by using
price_near_to=find_nearest(df['Close'], df['Close'][-2])
It's important to note that Lauritz's suggestion idea of using bisect does not actually find the closest value in MyList to MyNumber. Instead, bisect finds the next value in order after MyNumber in MyList. So in OP's case you'd actually get the position of 44 returned instead of the position of 4.
>>> myList = [1, 3, 4, 44, 88]
>>> myNumber = 5
>>> pos = (bisect_left(myList, myNumber))
>>> myList[pos]
...
44
To get the value that's closest to 5 you could try converting the list to an array and using argmin from numpy like so.
>>> import numpy as np
>>> myNumber = 5
>>> myList = [1, 3, 4, 44, 88]
>>> myArray = np.array(myList)
>>> pos = (np.abs(myArray-myNumber)).argmin()
>>> myArray[pos]
...
4
I don't know how fast this would be though, my guess would be "not very".
Expanding upon Gustavo Lima's answer. The same thing can be done without creating an entirely new list. The values in the list can be replaced with the differentials as the FOR loop progresses.
def f_ClosestVal(v_List, v_Number):
"""Takes an unsorted LIST of INTs and RETURNS INDEX of value closest to an INT"""
for _index, i in enumerate(v_List):
v_List[_index] = abs(v_Number - i)
return v_List.index(min(v_List))
myList = [1, 88, 44, 4, 4, -2, 3]
v_Num = 5
print(f_ClosestVal(myList, v_Num)) ## Gives "3," the index of the first "4" in the list.
If I may add to #Lauritz's answer
In order not to have a run error
don't forget to add a condition before the bisect_left line:
if (myNumber > myList[-1] or myNumber < myList[0]):
return False
so the full code will look like:
from bisect import bisect_left
def takeClosest(myList, myNumber):
"""
Assumes myList is sorted. Returns closest value to myNumber.
If two numbers are equally close, return the smallest number.
If number is outside of min or max return False
"""
if (myNumber > myList[-1] or myNumber < myList[0]):
return False
pos = bisect_left(myList, myNumber)
if pos == 0:
return myList[0]
if pos == len(myList):
return myList[-1]
before = myList[pos - 1]
after = myList[pos]
if after - myNumber < myNumber - before:
return after
else:
return before
def takeClosest(myList, myNumber):
newlst = []
for i in myList:
newlst.append(i - myNumber)
lstt = [abs(ele) for ele in newlst]
print(myList[lstt.index(min(lstt))])
myList = [4, 1, 88, 44, 3]
myNumber = 5
takeClosest(myList,myNumber)
Given a set of integers 1,2, and 3, find the number of ways that these can add up to n. (The order matters, i.e. say n is 5. 1+2+1+1 and 2+1+1+1 are two distinct solutions)
My solution involves splitting n into a list of 1s so if n = 5, A = [1,1,1,1,1]. And I will generate more sublists recursively from each list by adding adjacent numbers. So A will generate 4 more lists: [2,1,1,1], [1,2,1,1], [1,1,2,1],[1,1,1,2], and each of these lists will generate further sublists until it reaches a terminating case like [3,2] or [2,3]
Here is my proposed solution (in Python)
ways = []
def check_terminating(A,n):
# check for terminating case
for i in range(len(A)-1):
if A[i] + A[i+1] <= 3:
return False # means still can compute
return True
def count_ways(n,A=[]):
if A in ways:
# check if alr computed if yes then don't compute
return True
if A not in ways: # check for duplicates
ways.append(A) # global ways
if check_terminating(A,n):
return True # end of the tree
for i in range(len(A)-1):
# for each index i,
# combine with the next element and form a new list
total = A[i] + A[i+1]
print(total)
if total <= 3:
# form new list and compute
newA = A[:i] + [total] + A[i+2:]
count_ways(A,newA)
# recursive call
# main
n = 5
A = [1 for _ in range(n)]
count_ways(5,A)
print("No. of ways for n = {} is {}".format(n,len(ways)))
May I know if I'm on the right track, and if so, is there any way to make this code more efficient?
Please note that this is not a coin change problem. In coin change, order of occurrence is not important. In my problem, 1+2+1+1 is different from 1+1+1+2 but in coin change, both are same. Please don't post coin change solutions for this answer.
Edit: My code is working but I would like to know if there are better solutions. Thank you for all your help :)
The recurrence relation is F(n+3)=F(n+2)+F(n+1)+F(n) with F(0)=1, F(-1)=F(-2)=0. These are the tribonacci numbers (a variant of the Fibonacci numbers):
It's possible to write an easy O(n) solution:
def count_ways(n):
a, b, c = 1, 0, 0
for _ in xrange(n):
a, b, c = a+b+c, a, b
return a
It's harder, but possible to compute the result in relatively few arithmetic operations:
def count_ways(n):
A = 3**(n+3)
P = A**3-A**2-A-1
return pow(A, n+3, P) % A
for i in xrange(20):
print i, count_ways(i)
The idea that you describe sounds right. It is easy to write a recursive function that produces the correct answer..slowly.
You can then make it faster by memoizing the answer. Just keep a dictionary of answers that you've already calculated. In your recursive function look at whether you have a precalculated answer. If so, return it. If not, calculate it, save that answer in the dictionary, then return the answer.
That version should run quickly.
An O(n) method is possible:
def countways(n):
A=[1,1,2]
while len(A)<=n:
A.append(A[-1]+A[-2]+A[-3])
return A[n]
The idea is that we can work out how many ways of making a sequence with n by considering each choice (1,2,3) for the last partition size.
e.g. to count choices for (1,1,1,1) consider:
choices for (1,1,1) followed by a 1
choices for (1,1) followed by a 2
choices for (1) followed by a 3
If you need the results (instead of just the count) you can adapt this approach as follows:
cache = {}
def countwaysb(n):
if n < 0:
return []
if n == 0:
return [[]]
if n in cache:
return cache[n]
A = []
for last in range(1,4):
for B in countwaysb(n-last):
A.append(B+[last])
cache[n] = A
return A
I have a string, I need to find out palindromic sub-string of length 4( all 4 indexes sub-strings), in which the indexes should be in ascending order (index1<index2<index3<index4).
My code is working fine for small string like mystr. But when it comes to large string it takes long time.
from itertools import permutations
#Mystr
mystr = "kkkkkkz" #"ghhggh"
#Another Mystr
#mystr = "kkkkkkzsdfsfdkjdbdsjfjsadyusagdsadnkasdmkofhduyhfbdhfnsklfsjdhbshjvncjkmkslfhisduhfsdkadkaopiuqegyegrebkjenlendelufhdysgfdjlkajuadgfyadbldjudigducbdj"
l = len(mystr)
mylist = permutations(range(l), 4)
cnt = 0
for i in filter(lambda i: i[0] < i[1] < i[2] < i[3] and (mystr[i[0]] + mystr[i[1]] + mystr[i[2]] + mystr[i[3]] == mystr[i[3]] + mystr[i[2]] + mystr[i[1]] + mystr[i[0]]), mylist):
#print(i)
cnt += 1
print(cnt) # Number of palindromes found
If you want to stick with the basic structure of your current algorithm, a few ways to speed it up would be to use combinations instead of the permutations, which will return an iterable in sorted order. This means you don't need to check that the indexes are in ascending order. Secondly you can speed up the bit that checks for a palindrome by simply checking to see if the first two characters are identical to the last two characters reversed (instead of comparing the whole thing against its reversed self).
from itertools import combinations
mystr = "kkkkkkzsdfsfdkjdbdsjfjsadyusagdsadnkasdmkofhduyhfbdhfnsklfsjdhbshjvncjkmkslfhisduhfsdkadkaopiuqegyegrebkjenlendelufhdysgfdjlkajuadgfyadbldjudigducbdj"
cnt = 0
for m in combinations(mystr, 4):
if m[:2] == m[:1:-1]: cnt += 1
print cnt
Or if you want to simplify that last bit to a one-liner:
print len([m for m in combinations(mystr, 4) if m[:2] == m[:1:-1]])
I didn't do a real time test on this but on my system this method takes about 6.3 seconds to run (with your really long string) which is significantly faster than your method.
Is there any faster way to calculate this value in Python:
len([x for x in my_list if x in other_list])
I tried to use sets, since the lists' elements are unique, but I noticed no difference.
len(set(my_list).intersection(set(other_list)))
I'm working with big lists, so even the slightest improvement counts.
Thanks
Simple way is to find the least length'd list... than use that with set.intersection..., eg:
a = range(100)
b = range(50)
fst, snd = (a, b) if len(a) < len(b) else (b, a)
len(set(fst).intersection(snd))
I think a generator expression like so would be fast
sum(1 for i in my_list if i in other_list)
Otherwise a set intersection is about as fast as it will get
len(set(my_list).intersection(other_list))
From https://wiki.python.org/moin/TimeComplexity, set intersection for two sets s and t has time complexity:
Average - O(min(len(s), len(t))
Worst - O(len(s) * len(t))
len([x for x in my_list if x in other_list]) has complexity O(n^2) which is equivalent to the worst case for set.intersection().
If you use set.intersection() you only need to convert one of the lists to a set first:
So len(set(my_list).intersection(other_list)) should on average going to be faster than the nested list comprehension.
You could try using the filter function. Since you mentioned you're working with huge lists, ifilterof itertools module would be a good option:
from itertools import ifilter
my_set = set(range(100))
other_set = set(range(50))
for item in ifilter(lambda x: x in other_set, my_set):
print item
The idea is to sort the two lists first and then traverse them like we want to merge them, in order to find the elements in first list belonging also to second list. This way we have an O(n logn) algorithm.
def mycount(l, m):
l.sort()
m.sort()
i, j, counter = 0, 0, 0
while i < len(l) and j < len(m):
if l[i] == m[j]:
counter += 1
i += 1
elif l[i] < m[j]:
i += 1
else:
j += 1
return counter
From local tests it's 100 times faster than len([x for x in a if x in b]) when working with lists of 10000 elements.
EDIT:
Considering that the list elements are unique, the common elements will have a frequency two in the union of the two lists. Also they will be together when we sort this union. So the following is also valid:
def mycount(l, m):
s = sorted(l + m)
return sum(s[i] == s[i + 1] for i in xrange(len(s) - 1))
Similarily, we can use a counter:
from collections import Counter
def mycount(l, m):
c = Counter(l)
c.update(m)
return sum(v == 2 for v in c.itervalues())