I have been following Neetcode and working on several Leetcode problems. Here on 347 he advises that his solution is O(n), but I am having a hard time really breaking out the solution to determine why that is. I feel like it is because the nested for loop only runs until len(answers) == k.
I started off by getting the time complexity of each individual line and the first several are O(n) or O(1), which makes sense. Once I got to the nested for loop I was thrown off because it would make sense that the inner loop would run for each outer loop iteration and result in O(n*m), or something of that nature. This is where I think the limiting factor of the return condition comes in to act as the ceiling for the loop iterations since we will always return once the answers array length equals k (also, in the problem we are guaranteed a unique answer). What is throwing me off the most is that I default to thinking that any nested for loop is going to be O(n^2), which doesn't seem to be uncommon, but seems to not be the case every time.
Could someone please advise if I am on the right track with my suggestion? How would you break this down?
class Solution:
def topKFrequent(self, nums: List[int], k: int) -> List[int]:
countDict = {}
frequency = [[] for i in range(len(nums)+1)]
for j in nums:
countDict[j] = 1 + countDict.get(j, 0)
for c, v in countDict.items():
frequency[v].append(c)
answer = []
for n in range(len(frequency)-1, 0, -1):
for q in frequency[n]:
print(frequency[n])
answer.append(q)
if len(answer) == k:
return answer
frequency is mapping between frequency-of-elements and the values in the original list. The number of total elements inside of frequency is always equal or less than the number of original items in in nums (because it is mapping to the unique values in nums).
Even though there is a nested loop, it is still only ever iterating at some O(C*n) total values, where C <= 1 which is equal to O(n).
Note, you could clean up this answer fairly easily with some basic helpers. Counter can get you the original mapping for countDict. You can use a defaultdict to construct frequency. Then you can flatten the frequency dict values and slice the final result.
from collections import Counter, defaultdict
class Solution:
def top_k_frequent(self, nums: list[int], k: int) -> list[int]:
counter = Counter(nums)
frequency = defaultdict(list)
for num, freq in counter.items():
frequency[freq].append(num)
sorted_nums = []
for freq in sorted(frequency, reverse=True):
sorted_nums += frequency[freq]
return sorted_nums[:k]
A fun way to do this in a one liner!
class Solution:
def top_k_frequent(self, nums: list[int], k: int) -> list[int]:
return sorted(set(nums), key=Counter(nums).get, reverse=True)[:k]
Related
The task is to return the K most frequent elements.
What I did is to calculate the frequencies and put it in a min-heap (as we know there's no max-heap in Python), then I need to heappop k times.
from collections import defaultdict
import heapq
class Solution:
def topKFrequent(self, nums: List[int], k: int) -> List[int]:
counts = defaultdict(int)
for i in range(len(nums)):
counts[nums[i]] -= 1
counts = list(counts.items())
heapq.heapify(counts)
top_k = [heapq.heappop(counts)[0] for i in range(k)]
return top_k
Why does my code fails on topKFrequent([4,1,-1,2,-1,2,3], 2)?
You asked why it fails. It's failing because the heappop is returning the smallest item from counts. Counts is a tuple, so it is the smallest by the first element of the tuple, which are the nums. So you are getting back the nums, but only the unique ones because the dictionary is essentially acting like set.
I'm working on https://leetcode.com/problems/permutations/ and I'm trying to decide which approach for generating the permutations is more clear. The question is "Given an array nums of distinct integers, return all the possible permutations. You can return the answer in any order." I've got two different solutions below.
Solution 1
def permute(self, nums: List[int]) -> List[List[int]]:
results = []
N = len(nums)
def dfs(subset, permutation: List[int]):
if len(subset) == N:
results.append(subset.copy())
return
for i, num in enumerate(permutation):
subset.append(num)
dfs(subset, permutation[:i] + permutation[i+1:])
# backtracking
subset.pop()
dfs([], nums)
return results
Solution 2
def permute(self, nums: List[int]) -> List[List[int]]:
results = []
N = len(nums)
def dfs(subset, permutation: List[int]):
if len(subset) == N:
results.append(subset.copy())
return
for i, num in enumerate(permutation):
dfs(subset + [num], permutation[:i] + permutation[i+1:])
dfs([], nums)
return results
I believe in the first solution, when you append to a list in python (i.e append to the subset parameter), lists are pass by reference so each recursive call will share the same list. This is why we have to explicitly backtrack by popping from subset. However in the second solution when a list is passed to a recursive call with the syntax subset + [num], a copy of the list is passed to each recursive call so that's why we don't explicitly have to backtrack.
Can someone confirm if my assumptions are correct? Is one approach favored over another? I think the time and space complexities are identical for both approaches (O(N!) and O(N), respectively) where N = the number of elements in nums.
Yes you are right that the first permute passes the same object (subset) in each recursive call.
And this is possible in first permute because lists are mutable, if you had a string to permute upon then you have to pass a copy because they are immutable.
And in the second permute a copy of subset is created. You can test it with the statement print(id(subset)) at the beginning of dfs in each permute. You can observe that the statement prints same id in the first permute but not in the second permute.
To me even though both have same time complexity (depends on what you do at the base condition - its O(N.N!) and not O(N!) because you are appending a copy of list to the result list ), why do you want to create a copy of subset and place an entirely new object on stack when you can have the copy of object reference (not the object itself!) on the stack which consumes less memory. So I prefer first permute.
I was trying an online test. the test asked to write a function that given a list of up to 100000 integers whose range is 1 to 100000, would find the first missing integer.
for example, if the list is [1,4,5,2] the output should be 3.
I iterated over the list as follow
def find_missing(num)
for i in range(1, 100001):
if i not in num:
return i
the feedback I receives is the code is not efficient in handling big lists.
I am quite new and I couldnot find an answer, how can I iterate more efficiently?
The first improvement would be to make yours linear by using a set for the repeated membership test:
def find_missing(nums)
s = set(nums)
for i in range(1, 100001):
if i not in s:
return i
Given how C-optimized python sorting is, you could also do sth like:
def find_missing(nums)
s = sorted(set(nums))
return next(i for i, n in enumerate(s, 1) if i != n)
But both of these are fairly space inefficient as they create a new collection. You can avoid that with an in-place sort:
from itertools import groupby
def find_missing(nums):
nums.sort() # in-place
return next(i for i, (k, _) in enumerate(groupby(nums), 1) if i != k)
For any range of numbers, the sum is given by Gauss's formula:
# sum of all numbers up to and including nums[-1] minus
# sum of all numbers up to but not including nums[-1]
expected = nums[-1] * (nums[-1] + 1) // 2 - nums[0] * (nums[0] - 1) // 2
If a number is missing, the actual sum will be
actual = sum(nums)
The difference is the missing number:
result = expected - actual
This compulation is O(n), which is as efficient as you can get. expected is an O(1) computation, while actual has to actually add up the elements.
A somewhat slower but similar complexity approach would be to step along the sequence in lockstep with either a range or itertools.count:
for a, e in zip(nums, range(nums[0], len(nums) + nums[0])):
if a != e:
return e # or break if not in a function
Notice the difference between a single comparison a != e, vs a linear containment check like e in nums, which has to iterate on average through half of nums to get the answer.
You can use Counter to count every occurrence of your list. The minimum number with occurrence 0 will be your output. For example:
from collections import Counter
def find_missing():
count = Counter(your_list)
keys = count.keys() #list of every element in increasing order
main_list = list(range(1:100000)) #the list of values from 1 to 100k
missing_numbers = list(set(main_list) - set(keys))
your_output = min(missing_numbers)
return your_output
When having a list of lists in which all the sublists are ordered e.g.:
[[1,3,7,20,31], [1,2,5,6,7], [2,4,25,26]]
what is the fastest way to get the union of these lists without having duplicates in it and also get an ordered result?
So the resulting list should be:
[1,2,3,4,5,6,7,20,25,26,31].
I know I could just union them all without duplicates and then sort them but are there faster ways (like: do the sorting while doing the union) built into python?
EDIT:
Is the proposed answer faster than executing the following algorithm pairwise with all the sublists?
You can use heapq.merge for this:
def mymerge(v):
from heapq import merge
last = None
for a in merge(*v):
if a != last: # remove duplicates
last = a
yield a
print(list(mymerge([[1,3,7,20,31], [1,2,5,6,7], [2,4,25,26]])))
# [1, 2, 3, 4, 5, 6, 7, 20, 25, 26, 31]
(EDITED)
The asymptotic theoretical best approach to the problem is to use the priority queue, like, for example, the one implemented in heapq.merge() (thanks to #kaya3 for pointing this out).
However, in practice, a number of things can go wrong. For example the constant factors in the complexity analysis are large enough that a theoretically-optimal approach is, in real-life scenarios, slower.
This is fundamentally dependent on the implementation.
For example, Python suffer some speed penalty for explicit looping.
So, let's consider a couple of approaches and how the do perform for some concrete inputs.
Approaches
Just to give you some idea of the numbers we are discussing, here are a few approaches:
merge_sorted() which uses the most naive approach of flatten the sequence, reduce it to a set() (removing duplicates) and sort it as required
import itertools
def merge_sorted(seqs):
return sorted(set(itertools.chain.from_iterable(seqs)))
merge_heapq() which essentially #arshajii's answer. Note that the itertools.groupby() variation is slightly (less than ~1%) faster.
import heapq
def i_merge_heapq(seqs):
last_item = None
for item in heapq.merge(*seqs):
if item != last_item:
yield item
last_item = item
def merge_heapq(seqs):
return list(i_merge_heapq(seqs))
merge_bisect_set() is substantially the same algorithm as merge_sorted() except that the result is now constructed explicitly using the efficient bisect module for sorted insertions. Since sorted() is doing fundamentally the same thing but looping in Python, this is not going to be faster.
import itertools
import bisect
def merge_bisect_set(seqs):
result = []
for item in set(itertools.chain.from_iterable(seqs)):
bisect.insort(result, item)
return result
merge_bisect_cond() is similar to merge_bisect_set() but now the non-repeating constraint is explicitly done using the final list. However, this is much more expensive than just using set() (in fact it is so slow that it was excluded from the plots).
def merge_bisect_cond(seqs):
result = []
for item in itertools.chain.from_iterable(seqs):
if item not in result:
bisect.insort(result, item)
return result
merge_pairwise() explicitly implements the the theoretically efficient algorithm, similar to what you outlined in your question.
def join_sorted(seq1, seq2):
result = []
i = j = 0
len1, len2 = len(seq1), len(seq2)
while i < len1 and j < len2:
if seq1[i] < seq2[j]:
result.append(seq1[i])
i += 1
elif seq1[i] > seq2[j]:
result.append(seq2[j])
j += 1
else: # seq1[i] == seq2[j]
result.append(seq1[i])
i += 1
j += 1
if i < len1:
result.extend(seq1[i:])
elif j < len2:
result.extend(seq2[j:])
return result
def merge_pairwise(seqs):
result = []
for seq in seqs:
result = join_sorted(result, seq)
return result
merge_loop() implements a generalization of the above, where now pass is done only once for all sequences, instead of doing this pairwise.
def merge_loop(seqs):
result = []
lengths = list(map(len, seqs))
idxs = [0] * len(seqs)
while any(idx < length for idx, length in zip(idxs, lengths)):
item = min(
seq[idx]
for idx, seq, length in zip(idxs, seqs, lengths) if idx < length)
result.append(item)
for i, (idx, seq, length) in enumerate(zip(idxs, seqs, lengths)):
if idx < length and seq[idx] == item:
idxs[i] += 1
return result
Benchmarks
By generating the input using:
def gen_input(n, m=100, a=None, b=None):
if a is None and b is None:
b = 2 * n * m
a = -b
return tuple(tuple(sorted(set(random.randint(int(a), int(b)) for _ in range(n)))) for __ in range(m))
one can plot the timings for varying n:
Note that, in general, the performances will vary for different values of n (the size of each sequence) and m (the number of sequences), but also of a and b (the minimum and the maximum number generated).
For brevity, this was not explored in this answer, but feel free to play around with it here, which also includes some other implementations, notably some tentative speed-ups with Cython that were only partially successful.
You can make use of sets in Python-3.
mylist = [[1,3,7,20,31], [1,2,5,6,7], [2,4,25,26]]
mynewlist = mylist[0] + mylist[1] + mylist[2]
print(list(set(mynewlist)))
Output:
[1, 2, 3, 4, 5, 6, 7, 20, 25, 26, 31]
First merge all the sub-lists using list addition.
Then convert it into a set object where it will delete all the duplicates which will also be sorted in ascending order.
Convert it back to list.It gives your desired output.
Hope it answers your question.
I have the following code for the Leetcode's top k Frequent question.
The time limit complexity allowed is smaller than o(nlogn), where n is the array size
Isn't my big O complexity of o(n)?
If so why am I still exceeding the time limit ?
def topKFrequent(self, nums, k):
output = {}
outlist = []
for item in nums:
output[item] = nums.count(item)
max_count = sorted(output.values(),reverse= True)[:k]
for key,val in output.items():
if val in max_count:
outlist.append(key)
return (outlist)
testinput: array [1,1,1,2,2,3,1,1,1,2,2,3] k = 2
testoutput: [1,2]
Question link: https://leetcode.com/problems/top-k-frequent-elements/
Your solution is O(n^2), because of this:
for item in nums:
output[item] = nums.count(item)
For each item in your array, you're looking through the whole array to count the number of elements which are the same.
Instead of doing this, you can get the counts in O(n) by iterating nums and adding 1 to the counter of each item you find as you go.
The O(n log n) in the end will come from sorted(output.values(), reverse=True) because every generic sorting algorithm (including Timsort) will be O(n log n).
As another answer mentions, your counting is O(n^2) time complexity, which is causing your time limit exceeded. Fortunately, python comes with a Counter object in the collections module, which will do exactly what the other answer describes, but in well-optimized C code. This will reduce your time complexity to O(nlogn).
Furthermore, you can reduce your time complexity to O(nlogk) by replacing the sort call with a min-heap trick. Keep a min-heap of size k, and add the other elements and pop the min one by one, until all elements have been inserted (at some point or another). The k that remain in the heap are your maximum k values.
from collections import Counter
from heapq import heappushpop, heapify
def get_most_frequent(nums, k):
counts = Counter(nums)
counts = [(v, k) for k, v in counts.items()]
heap = counts[:k]
heapify(heap)
for count in counts[k:]:
heappushpop(heap, count)
return [k for v, k in heap]
If you must return the elements in any particular order, you can sort the k elements in O(klogk) time, which still results in the same O(nlogk) time complexity overall.