Is iterating through a set faster than through a list? - python

I'm doing the longest consecutive sequence problem on LeetCode (https://leetcode.com/problems/longest-consecutive-sequence/) and wrote the following solution:
(I made a typo earlier and put s instead of nums on line 6)
class Solution:
def longestConsecutive(self, nums: List[int]) -> int:
s = set(nums)
res = 0
for n in nums:
if n - 1 not in nums:
c = 1
while n + 1 in s:
c += 1
n += 1
res = max(res, c)
return res
This solution takes 4902 ms according to the website, but when I change the first for loop to
for n in s:
The runtime drops to 491 ms. Is looping through the hashset 10 times faster?

If you changed if n - 1 not in nums to if n - 1 not in s, then you might see it reduces the runtime a lot. in operator in set is faster in list. Generally, in in set takes O(1), while it takes O(n) for in in list. https://wiki.python.org/moin/TimeComplexity
Regarding iterating in set and list, iterating through a set can be faster if there are lots of duplicates in the list. E.g., iterating through a list with n same elements takes O(n), while it takes O(1) since there will be only one element in the set.

Is iterating through a set faster than through a list?
No, iterating through either of these data structures is the same for the same number of elements.
However, while n + 1 in s does not necessarily iterate through the elements of s. in here is an operator that checks if the value n+1 is an element of s. If s is a set, this operation is guaranteed to have O(1) time. If s is a list, then the operator will have O(n) time.

Related

Leetcode 46: Choosing between two different ways to generate permutations?

I'm working on https://leetcode.com/problems/permutations/ and I'm trying to decide which approach for generating the permutations is more clear. The question is "Given an array nums of distinct integers, return all the possible permutations. You can return the answer in any order." I've got two different solutions below.
Solution 1
def permute(self, nums: List[int]) -> List[List[int]]:
results = []
N = len(nums)
def dfs(subset, permutation: List[int]):
if len(subset) == N:
results.append(subset.copy())
return
for i, num in enumerate(permutation):
subset.append(num)
dfs(subset, permutation[:i] + permutation[i+1:])
# backtracking
subset.pop()
dfs([], nums)
return results
Solution 2
def permute(self, nums: List[int]) -> List[List[int]]:
results = []
N = len(nums)
def dfs(subset, permutation: List[int]):
if len(subset) == N:
results.append(subset.copy())
return
for i, num in enumerate(permutation):
dfs(subset + [num], permutation[:i] + permutation[i+1:])
dfs([], nums)
return results
I believe in the first solution, when you append to a list in python (i.e append to the subset parameter), lists are pass by reference so each recursive call will share the same list. This is why we have to explicitly backtrack by popping from subset. However in the second solution when a list is passed to a recursive call with the syntax subset + [num], a copy of the list is passed to each recursive call so that's why we don't explicitly have to backtrack.
Can someone confirm if my assumptions are correct? Is one approach favored over another? I think the time and space complexities are identical for both approaches (O(N!) and O(N), respectively) where N = the number of elements in nums.
Yes you are right that the first permute passes the same object (subset) in each recursive call.
And this is possible in first permute because lists are mutable, if you had a string to permute upon then you have to pass a copy because they are immutable.
And in the second permute a copy of subset is created. You can test it with the statement print(id(subset)) at the beginning of dfs in each permute. You can observe that the statement prints same id in the first permute but not in the second permute.
To me even though both have same time complexity (depends on what you do at the base condition - its O(N.N!) and not O(N!) because you are appending a copy of list to the result list ), why do you want to create a copy of subset and place an entirely new object on stack when you can have the copy of object reference (not the object itself!) on the stack which consumes less memory. So I prefer first permute.

most efficient way to iterate over a large array looking for a missing element in Python

I was trying an online test. the test asked to write a function that given a list of up to 100000 integers whose range is 1 to 100000, would find the first missing integer.
for example, if the list is [1,4,5,2] the output should be 3.
I iterated over the list as follow
def find_missing(num)
for i in range(1, 100001):
if i not in num:
return i
the feedback I receives is the code is not efficient in handling big lists.
I am quite new and I couldnot find an answer, how can I iterate more efficiently?
The first improvement would be to make yours linear by using a set for the repeated membership test:
def find_missing(nums)
s = set(nums)
for i in range(1, 100001):
if i not in s:
return i
Given how C-optimized python sorting is, you could also do sth like:
def find_missing(nums)
s = sorted(set(nums))
return next(i for i, n in enumerate(s, 1) if i != n)
But both of these are fairly space inefficient as they create a new collection. You can avoid that with an in-place sort:
from itertools import groupby
def find_missing(nums):
nums.sort() # in-place
return next(i for i, (k, _) in enumerate(groupby(nums), 1) if i != k)
For any range of numbers, the sum is given by Gauss's formula:
# sum of all numbers up to and including nums[-1] minus
# sum of all numbers up to but not including nums[-1]
expected = nums[-1] * (nums[-1] + 1) // 2 - nums[0] * (nums[0] - 1) // 2
If a number is missing, the actual sum will be
actual = sum(nums)
The difference is the missing number:
result = expected - actual
This compulation is O(n), which is as efficient as you can get. expected is an O(1) computation, while actual has to actually add up the elements.
A somewhat slower but similar complexity approach would be to step along the sequence in lockstep with either a range or itertools.count:
for a, e in zip(nums, range(nums[0], len(nums) + nums[0])):
if a != e:
return e # or break if not in a function
Notice the difference between a single comparison a != e, vs a linear containment check like e in nums, which has to iterate on average through half of nums to get the answer.
You can use Counter to count every occurrence of your list. The minimum number with occurrence 0 will be your output. For example:
from collections import Counter
def find_missing():
count = Counter(your_list)
keys = count.keys() #list of every element in increasing order
main_list = list(range(1:100000)) #the list of values from 1 to 100k
missing_numbers = list(set(main_list) - set(keys))
your_output = min(missing_numbers)
return your_output

Python return list of duplicates in list in order

How do you quickly return a list of duplicates for a list in the order that they appear? For example duplicates([2,3,5,5,5,6,6,3]) results in [5,6,3] meaning that the repeated element is only added to the resulting duplicates list when its second element appears. So far I have the code below but its not running fast enough to pass large test cases. Is there any faster option without imports?
def duplicates(L):
first = set()
second = []
for i in L:
if i in first and i not in second:
second.append(i)
continue
if i not in first and i not in second:
first.add(i)
continue
return second
You're doing well in using a set first since it has a O(1) time complexity for the in operation.
But in the other hand you're using a list for second, wich turns this function into a O(N^2) and, in the worst case, you're going through the second list twice.
So, my sugestion to you is using a dictionary to store the numbers you found.
For exemple:
def duplicates(L):
first = dict()
second=[]
for i in L:
if i not in first: #First time the number appears
first[i] = False
elif not first[i]: #Number not on the second list
second.append(i)
first[i]=True
return second
Note that I used a Boolean for the dictionary key value to represent whether the number appears more than 1 time or not (or if it was already added to de second list).
This solution has O(N) time complexity wich means that is way faster.
Revision of posted code
Use a dictionary rather than a list for 'second' in OP code.
List Dicts have O(1) rather than O(n) look up times
Dicts keep track of order of insertion for Python 3.6+ or could use OrderedDict
Code
def duplicatesnew(L):
first = set()
second = {} # Change to dictionary
for i in L:
if i in first and i not in second:
second[i] = None
continue
if i not in first and i not in second:
first.add(i)
continue
return list(second.keys()) # report list of keys
lst = [2,3,5,5,5,6,6,3]
Performance
Summary
Comparable on short lists
2X faster on longer lists
Tests
Use lists of length N
For N = 6: use original list
N > 6 use:
lst = [random.randint(1, 10) for _ in range(N)]
N = 6
Original: 2.24 us
Revised: 2.74 us
1000 random numbers between 1 and 10
Original: 241 us
Revised: 146 us
N = 100, 000
Original: 27.2 ms
Revised: 13.4 ms

Longest phrase in Tweet - Python timeout

Input - array / list a, constant k
Output - Length of Longest sublist/subarray with sum <=k
E.g. given
I am Bob
i.e. array [1,2,3] and k=3
Sublists possible are [1],[2],[3],[1,2]
Longest sublist here is [1,2]
Length = 2
Issue - TimeOut error in Python on Hackerrank
Time Complexity - 1 for loop - O(n)
Space complexity O(n)
def maxLength(a, k):
lenmax=0
dummy=[]
for i in a:
dummy.append(i)
if sum(dummy)<=k:
lenmax=max(lenmax,len(dummy))
else:
del dummy[0]
return lenmax
Solved it by replacing the time-intensive operation
Time-out occurs when it exceeds the time limit set by HackerRank for every environment "HackerRank TimeOut"
Solution
Replace sum() function by a variable
In worst case, sum(list) would take O(n^2) time if the entire list was to be summed up all the time.
Instead, maintaining a variable would mean O(n) for the entire function as O(1) for updating a variable.
def maxLength(a, k):
lenmax=0
dummy=[]
sumdummy=0
for i in a:
dummy.append(i)
sumdummy+=i
if sumdummy<=k:
lenmax=max(lenmax,len(dummy))
else:
sumdummy-=dummy[0]
del dummy[0]
return lenmax

Is the actual performance difference between these two O(n^2) algorithms coming from cache/memory access?

I wrote two methods to sort a list of numbers, they have the same time complexity: O(n^2), but the actually running time is 3 times difference ( the second method uses 3 times as much time as the first one ).
My guess is the difference comes from the memory hierarchy ( count of register/cache/memory fetches are quite different ), is it correct?
To be specific: the 1st method compare one list element with a variable and assign value between them, the 2nd method compares two list elements and assign values between them. I guess this means the 2nd method has much more cache/memory fetches than the 1st one. Right?
When list has 10000 elements, the loop count and running time are as below:
# TakeSmallestRecursivelyToSort Loop Count: 50004999
# TakeSmallestRecursivelyToSort Time: 7861.999988555908 ms
# CompareOneThenMoveUntilSort Loop Count: 49995000
# CompareOneThenMoveUntilSort Time: 17115.999937057495 ms
This is the code:
# first method
def TakeSmallestRecursivelyToSort(input_list: list) -> list:
"""In-place sorting, find smallest element and swap."""
count = 0
for i in range(len(input_list)):
#s_index = Find_smallest(input_list[i:]) # s_index is relative to i
if len(input_list[i:]) == 0:
raise ValueError
if len(input_list[i:]) == 1:
break
index = 0
smallest = input_list[i:][0]
for e_index, j in enumerate(input_list[i:]):
count += 1
if j < smallest:
index = e_index
smallest = j
s_index = index
input_list[i], input_list[s_index + i] = input_list[s_index + i], input_list[i]
print('TakeSmallestRecursivelyToSort Count', count)
return input_list
# second method
def CompareOneThenMoveUntilSort(input_list: list) -> list:
count = 0
for i in range(len(input_list)):
for j in range(len(input_list) - i - 1):
count += 1
if input_list[j] > input_list[j+1]:
input_list[j], input_list[j+1] = input_list[j+1], input_list[j]
print('CompareOneThenMoveUntilSort Count', count)
return input_list
Your first algorithm may make O(N^2) comparisons, but it only makes O(N) swaps. It's those swaps that take the most time. If you removed the swaps from the second algorithm you'll see that it then takes significantly less time:
def CompareOneThenMoveUntilSortNoSwap(input_list: list) -> list:
for i in range(len(input_list)):
for j in range(len(input_list) - i - 1):
if input_list[j] > input_list[j+1]:
pass
# 1000 randomised sequential integers, 100 repeats
TakeSmallestRecursivelyToSort: 4.625916245975532
CompareOneThenMoveUntilSort: 10.164166125934571
CompareOneThenMoveUntilSortNoSwap: 4.86395191506017
Just because two algorithms are in the same asymptotic order doesn't mean they'll be just as fast. Those constant costs still count when comparing implementations of an algorithm within the same order class. So while the two implementations will show the same exponential curve as you plot time taken for the number of elements sorted, the CompareOneThenMoveUntilSort implementation plots the line higher up the time-taken chart.
Note that you have increased the constant cost of each N loop in the TakeSmallestRecursivelyToSort implementation by adding 4 additional O(N) loops in there. Each inputlist[i:] slice creates a new list object, copying across all references from index i onwards to the new list. It could be faster still:
def TakeSmallestRecursivelyToSortImproved(input_list: list) -> list:
"""In-place sorting, find smallest element and swap."""
l = len(input_list)
for i in range(l - 1):
index = i
smallest = input_list[i]
for j, value in enumerate(input_list[i + 1:], i + 1):
if value < smallest:
smallest, index = value, j
input_list[i], input_list[index] = input_list[index], input_list[i]
return input_list
This one takes about 3 seconds.

Categories