What is the run time of the set difference function in Python? - python

The question explains it, but what is the time complexity of the set difference operation in Python?
EX:
A = set([...])
B = set([...])
print(A.difference(B)) # What is the time complexity of the difference function?
My intuition tells me O(n) because we can iterate through set A and for each element, see if it's contained in set B in constant time (with a hash function).
Am I right?
(Here is the answer that I came across: https://wiki.python.org/moin/TimeComplexity)

looks that you're right, difference is performed with O(n) complexity in the best cases
But keep in mind that in worst cases (maximizing collisions with hashes) it can raise to O(n**2) (since lookup worst case is O(n): How is set() implemented?, but it seems that you can generally rely on O(1))
As an aside, speed depends on the type of object in the set. Integers hash well (roughly as themselves, with probably some modulo), whereas strings need more CPU.

https://wiki.python.org/moin/TimeComplexity suggests that its O(cardinality of set A) in the example you described.
My understanding that its O(len(A)) and not O(len(B)) because, you only need to check if each element in setA is present in setB. Each lookup in setB is O(1), hence you will be doing len(A) * O(1) lookups on setB. Since O(1) is constant, then its O(len(A))
Eg:
A = {1,2,3,4,5}
B = {3,5,7,8,9,10,11,12,13}
A-B = {1,2,4}
When A-B is called, iterate through every element in A (only 5 elements), and check for membership in B. If not found in B, then it will be present in the result.
Note: Of course all this is amortised complexity. In practice, each lookup in setB could be more than O(1).

Related

Complexity of built-in methods like list.count(). Do they take O(1) time?

I have two methods to count occurrences of any element. One is using built-in method count and other is using loop.
Time complexity for 2nd method is O(n), but not sure of built-in method.
Does count take time of O(1) or O(n)?Please also tell me about other built-in methods like reverse, index, etc. Using count.
List1 = [10,4,5,10,6,4,10]
print(List1.count(10))
using loop
List2 = [10,4,5,10,6,4,10]
count = 0
for ele in List2:
if (ele == 10):
count += 1
print(count)
As per the documentation
list.count(x) - Return the number of times x appears in the list.
Now think about it: if you have 10 cups over some coloured balls, can you be 100% certain about the number of red balls under the cups before you check under all of the cups?
Hint: No
Therefore, list.count(x) has to check the entire list. As the list has size n, list.count(x) has to be O(n).
EDIT: For the pedantic readers out there, of course there could be an implementation of lists that stores the count of every item. This would lead to an increase in memory usage but would provide the O(1) for list.count(x).
EDIT2: You can have a look at the implementation of list.count here. You will see the for loop that runs exactly n times, definitely answering your question: Built-in methods do not necessarily take O(1) time, list.count(x) being an example of a built-in method which is O(n)
The build in count() method in python is also having time complexity of O(n)
The time complexity of the count(value) method is O(n) for a list with n elements. The standard Python implementation cPython “touches” all elements in the original list to check if they are equal to the value. Thus, the time complexity is linear in the number of list elements.
It's an easy thing to see for yourself.
>>> import timeit
>>> timeit.timeit('x.count(10)', 'x=list(range(100))', number=1000)
0.007884800899773836
>>> timeit.timeit('x.count(10)', 'x=list(range(1000))', number=1000)
0.03378760418854654
>>> timeit.timeit('x.count(10)', 'x=list(range(10000))', number=1000)
0.2234031839761883
>>> timeit.timeit('x.count(10)', 'x=list(range(100000))', number=1000)
2.1812812101561576
Maybe a little better than O(n), but definitely closer to that than O(1).

Can the Efficiency of this Algorithm be Linear?

My textbook says that the following algorithm has an efficiency of O(n):
list = [5,8,4,5]
def check_for_duplicates(list):
dups = []
for i in range(len(list)):
if list[i] not in dups:
dups.append(list[i])
else:
return True
return False
But why? I ask because the in operation has an efficiency of O(n) as well (according to this resource). If we take list as an example the program needs to iterate 4 times over the list. But with each iteration, dups keeps growing faster. So for the first iteration over list, dups does not have any elements, but for the second iteration it has one element, for the third two elements and for the fourth three elements. Wouldn't that make 1 + 2 + 3 = 6 extra iterations for the in operation on top of the list iterations? But if this is true then wouldn't this alter the efficiency significantly, as the sum of the extra iterations grows faster with every iteration?
You are correct that the runtime of the code that you've posted here is O(n2), not O(n), for precisely the reason that you've indicated.
Conceptually, the algorithm you're implementing goes like this:
Maintain a collection of all the items seen so far.
For each item in the list:
If that item is in the collection, report a duplicate exists.
Otherwise, add it to the collection.
Report that there are no duplicates.
The reason the code you've posted here is slow is because the cost of checking whether a duplicate exists is O(n) when using a list to track the items seen so far. In fact, if you're using a list of the existing elements, what you're doing is essentially equivalent to just checking the previous elements of the array to see if any of them are equal!
You can speed this up by switching your implementation so that you use a set to track prior elements rather than a list. Sets have (expected) O(1) lookups and insertions, so this will make your code run in (expected) O(1) time.

Python - time complexity of not in set

I know the time complexity of checking if x in set is O(1) but what about if x not in set? Would that be O(1) still because set is similar to a dictionary?
x not in some_set just negates the result of x in some_set, so it has the same time complexity. This is the case for any object, set or not. You can take a look at the place where the CPython implementation does res = !res; if you want.
For more information on the time complexities of Python Data Structures please reference this https://wiki.python.org/moin/TimeComplexity.
From this it is shown that x in s performs O(1) on average and O(n) in worst case. So as pointed out by
user2357112 x not in s is equivalent to not x in s which just negates the result of x in s and will have the same time complexity.

Python heapq vs sorted speed for pre-sorted lists

I have a reasonably large number n=10000 of sorted lists of length k=100 each. Since merging two sorted lists takes linear time, I would imagine its cheaper to recursively merge the sorted lists of length O(nk) with heapq.merge() in a tree of depth log(n) than to sort the entire thing at once with sorted() in O(nklog(nk)) time.
However, the sorted() approach seems to be 17-44x faster on my machine. Is the implementation of sorted() that much faster than heapq.merge() that it outstrips the asymptotic time advantage of the classic merge?
import itertools
import heapq
data = [range(n*8000,n*8000+10000,100) for n in range(10000)]
# Approach 1
for val in heapq.merge(*data):
test = val
# Approach 2
for val in sorted(itertools.chain(*data)):
test = val
CPython's list.sort() uses an adaptive merge sort, which identifies natural runs in the input, and then merges them "intelligently". It's very effective at exploiting many kinds of pre-existing order. For example, try sorting range(N)*2 (in Python 2) for increasing values of N, and you'll find the time needed grows linearly in N.
So the only real advantage of heapq.merge() in this application is lower peak memory use if you iterate over the results (instead of materializing an ordered list containing all the results).
In fact, list.sort() is taking more advantage of the structure in your specific data than the heapq.merge() approach. I have some insight into this, because I wrote Python's list.sort() ;-)
(BTW, I see you already accepted an answer, and that's fine by me - it's a good answer. I just wanted to give a bit more info.)
ABOUT THAT "more advantage"
As discussed a bit in comments, list.sort() plays lots of engineering tricks that may cut the number of comparisons needed over what heapq.merge() needs. It depends on the data. Here's a quick account of what happens for the specific data in your question. First define a class that counts the number of comparisons performed (note that I'm using Python 3, so have to account for all possible comparisons):
class V(object):
def __init__(self, val):
self.val = val
def __lt__(a, b):
global ncmp
ncmp += 1
return a.val < b.val
def __eq__(a, b):
global ncmp
ncmp += 1
return a.val == b.val
def __le__(a, b):
raise ValueError("unexpected comparison")
__ne__ = __gt__ = __ge__ = __le__
sort() was deliberately written to use only < (__lt__). It's more of an accident in heapq (and, as I recall, even varies across Python versions), but it turns out .merge() only required < and ==. So those are the only comparisons the class defines in a useful way.
Then changing your data to use instances of that class:
data = [[V(i) for i in range(n*8000,n*8000+10000,100)]
for n in range(10000)]
Then run both methods:
ncmp = 0
for val in heapq.merge(*data):
test = val
print(format(ncmp, ","))
ncmp = 0
for val in sorted(itertools.chain(*data)):
test = val
print(format(ncmp, ","))
The output is kinda remarkable:
43,207,638
1,639,884
So sorted() required far fewer comparisons than merge(), for this specific data. And that's the primary reason it's much faster.
LONG STORY SHORT
Those comparison counts looked too remarkable to me ;-) The count for heapq.merge() looked about twice as large as I thought reasonable.
Took a while to track this down. In short, it's an artifact of the way heapq.merge() is implemented: it maintains a heap of 3-element list objects, each containing the current next value from an iterable, the 0-based index of that iterable among all the iterables (to break comparison ties), and that iterable's __next__ method. The heapq functions all compare these little lists (instead of just the iterables' values), and list comparison always goes thru the lists first looking for the first corresponding items that are not ==.
So, e.g., asking whether [0] < [1] first asks whether 0 == 1. It's not, so then it goes on to ask whether 0 < 1.
Because of this, each < comparison done during the execution of heapq.merge() actually does two object comparisons (one ==, the other <). The == comparisons are "wasted" work, in the sense that they're not logically necessary to solve the problem - they're just "an optimization" (which happens not to pay in this context!) used internally by list comparison.
So in some sense it would be fairer to cut the report of heapq.merge() comparisons in half. But it's still way more than sorted() needed, so I'll let it drop now ;-)
sorted uses an adaptive mergesort that detects sorted runs and merges them efficiently, so it gets to take advantage of all the same structure in the input that heapq.merge gets to use. Also, sorted has a really nice C implementation with a lot more optimization effort put into it than heapq.merge.

Complexity of enumerate

I see a lot of questions about the run-time complexity of python's built in methods, and there are a lot of answers for a lot of the methods (e.g. https://wiki.python.org/moin/TimeComplexity , https://www.ics.uci.edu/~pattis/ICS-33/lectures/complexitypython.txt , Cost of len() function , etc.)
What I don't see anything that addresses enumerate. I know it returns at least one new array (the indexes) but how long does it take to generate that and is the other array just the original array?
In other words, I'm assuming it's O(n) for creating a new array (iteration) and O(1) for the reuse of the original array...O(n) in total (I think). Is the another O(n) for the copy making it O(n^2), or something else...?
The enumerate-function returns an iterator. The concept of an iterator is described here.
Basically this means that the iterator gets initialized pointing to the first item of the list and then returning the next element of the list every time its next() method gets called.
So the complexity should be:
Initialization: O(1)
Returning the next element: O(1)
Returning all elements: n * O(1)
Please note that enumerate does NOT create a new data structure (list of tuples or something like that)! It is just iterating over the existing list, keeping the element index in mind.
You can try this out by yourself:
# First, create a list containing a lot of entries:
# (O(n) - executing this line should take some noticeable time)
a = [str(i) for i in range(10000000)] # a = ["0", "1", ..., "9999999"]
# Then call the enumeration function for a.
# (O(1) - executes very fast because that's just the initialization of the iterator.)
b = enumeration(a)
# use the iterator
# (O(n) - retrieving the next element is O(1) and there are n elements in the list.)
for i in b:
pass # do nothing
Assuming the naïve approach (enumerate duplicates the array, then iterates over it), you have O(n) time for duplicating the array, then O(n) time for iterating over it. If that was just n instead of O(n), you would have 2 * n time total, but that's not how O(n) works; all you know is that the amount of time it takes will be some multiple of n. That's (basically) what O(n) means anyway, so in any case, the enumerate function is O(n) time total.
As martineau pointed out, enumerate() does not make a copy of the array. Instead it returns an object which you use to iterate over the array. The call to enumerate() itself is O(1).

Categories