In Python, what are the running time and space complexities if a list is converted to a set?
Example:
data = [1,2,3,4,5,5,5,5,6]
# this turns list to set and overwrites the list
data = set(data)
print data
# output will be (1,2,3,4,5,6)
Converting a list to a set requires that every item in the list be visited once, O(n). Inserting an element into a set is O(1), so the overall time complexity would be O(n).
Space required for the new set is less than or equal to the length of the list, so that is also O(n).
Here's a good reference for Python data structures.
You have to iterate through the entire list, which is O(n) time, and then insert each into a set, which is O(1) time. So the overall time complexity is O(n), where n is the length of the list.
No other space other than the set being created or the list being used is needed.
As others have stated regarding the runtime, the set creation time is O(N) for the entire list, and set existence check is O(1) for each item.
But their comments on memory usage being the same between lists and sets are incorrect.
In python, sets can use 3x to 10x more memory than lists. Set memory usage growth still O(N), but it's always at least 3x more than lists. Maybe because of needing to keep in memory all those hashes.
related: https://stackoverflow.com/a/54891295/1163355
Related
I have a set in Python and I want to sample one element from it, like with random.sample() method. The problem is that sample() converts set to tuple internally which is O(n) and I have to do it in the most optimal way.
Is there a function that I can use to sample an element from a set with time complexity O(1) or the only way to do this is by creating your own implementation of set?
Because the data layout is irregular, it’s impossible to uniformly sample from a hash-based set in O(1) except, in the case of ω(n) queries, by preprocessing it into some sort of array. (Such an array could be maintained while building the set, of course, but that’s not the starting point given and isn’t faster to add than the tuple conversion.)
I've the following code and I'm trying to get the time complexity.
seen = set()
a=[4,4,4,3,3,2,1,1,1,5,5]
result = []
for item in a:
if item not in seen:
seen.add(item)
result.append(item)
print (result)
As far as my understanding goes as I'm accessing the list the time complexity for that operation would be O(n). As with the if block each time I've a lookup to the set and that would cost another O(n). So is the overall time complexity O(n^2)? Does the set.add() also add to the complexity?
Also, with the space complexity is it O(n)? Because the size of the set increases each time it encounters a new element?
Any input or links to get a proper insight into time and space complexities is appreciated.
Sets in Python are implemented as hash tables (similar to dictionaries), so both in and set.add() are O(1). list.append() is also O(1) amortized.
Altogether, that means the time complexity is O(n), due to the iteration over a.
Space complexity is O(n) as well, because the maximum space required is proportional to the size of the input.
A useful reference for the time complexity of various operations on Python collections can by found at https://wiki.python.org/moin/TimeComplexity … and the PyCon talk The Mighty Dictionary provides an interesting delve into how Python achieves O(1) complexity for various set and dict operations.
I know that bisect is using binary search to keep lists sorted. However I did a timing test that the values are being read and sorted. But, on the contrary to my knowledge, keeping the values and then sorting them win the timing by high difference. Could more experienced users please explain this behavior ? Here is the code I use to test the timings.
import timeit
setup = """
import random
import bisect
a = range(100000)
random.shuffle(a)
"""
p1 = """
b = []
for i in a:
b.append(i)
b.sort()
"""
p2 = """
b = []
for i in a:
bisect.insort(b, i)
"""
print timeit.timeit(p1, setup=setup, number = 1)
print timeit.timeit(p2, setup=setup, number = 1)
# 0.0593081859178
# 1.69218442959
# Huge difference ! 35x faster.
In the first process I take values one-by-one instead of just sorting a to obtain a behavior like file reading. And it beats bisect very hard.
Sorting a list takes about O(N*log(N)) time. Appending N items to a list takes O(N) time. Doing these things consecutively takes about O(N*log(N)) time.
Bisecting a list takes O(log(n)) time. Inserting an item into a list takes O(N) time. Doing both N times inside a for loop takes O(N * (N + log(n))) == O(N^2) time.
O(N^2) is worse than O(N*log(N)), so your p1 is faster than your p2.
Your algorithmic complexity will be worse in the bisect case ...
In the bisect case, you have N operations (each at an average cost of log(N) to find the insertion point and then an additional O(N) step to insert the item). Total complexity: O(N^2).
With sort, you have a single Nlog(N) step (plus N O(1) steps to build the list in the first place). Total complexity: O(Nlog(N))
Also note that sort is implemented in very heavily optimized C code (bisect isn't quite as optimized since it ends up calling various comparison functions much more frequently...)
To understand the time difference, let’s look at what you are actually doing there.
In your first example, you are taking an empty list, and append items to it, and sorting it in the end.
Appending to lists is really cheap, it has an amortized time complexity of O(1). It cannot be really constant time because the underlying data structure, a simple array, eventually needs to be expanded as the list grows. This is done every so often which causes a new array to be allocated and the data being copied. That’s a bit more expensive. But in general, we still say this is O(1).
Next up comes the sorting. Python is using Timsort which is very efficient. This is O(n log n) at average and worst case. So overall, we get constant time following O(n log n) so the sorting is the only thing that matters here. In total, this is pretty simple and very fast.
The second example uses bisect.insort. This utilizes a list and binary search to ensure that the list is sorted at all times.
Essentially, on every insert, it will use binary search to find the correct location to insert the new value, and then shift all items correctly to make room at that index for the new value. Binary search is cheap, O(log n) on average, so this is not a problem. Shifting alone is also not that difficult. In the worst case, we need to move all items one index to the right, so we get O(n) (this is basically the insert operation on lists).
So in total, we would get linear time at worst. However, we do this on every single iteration. So when inserting n elements, we have O(n) each time. This results in a quadratic complexity, O(n²). This is a problem, and will ultimately slow the whole thing down.
So what does this tell us? Sorted inserting into a list to get a sorted result is not really performant. We can use the bisect module to keep an already sorted list ordered when we only do a few operations, but when we actually have unsorted data, it’s easier to sort the data as a whole.
Insertion and deletion operations in a data structure can be surprisingly expensive sometimes, particularly if the distribution of incoming data values is random. Whereas, sorting can be unexpectedly fast.
A key consideration is whether-or-not you can "accumulate all the values," then sort them once, then use the sorted result "all at once." If you can, then sorting is almost always very-noticeably faster.
If you remember the old sci-fi movies (back when computers were called "giant brains" and a movie always had spinning tape-drives), that's the sort of processing that they were supposedly doing: applying sorted updates to also-sorted master tapes, to produce a new still-sorted master. Random-access was not needed. (Which was a good thing, because at that time we really couldn't do it.) It is still an efficient way to process vast amounts of data.
I notice that when using sys.getsizeof() to check the size of list and dictionary, something interesting happens.
i have:
a = [1,2,3,4,5]
with the size of 56 bytes (and empty list has size of 36, so it makes sense because 20/5 = 4)
however, after I remove all the items in the list (using .remove or del), the size is still 56. This is strange to me. Shouldn't the size be back to 36?
Any explanation?
The list doesn't promise to release memory when you remove elements. Lists are over-allocated, which is how they can have amortized O(1) performance for appending elements.
Details of the time performance of the data structures: http://wiki.python.org/moin/TimeComplexity
Increasing the size of a container can be an expensive operation, since it may require that a lot of things be moved around in memory. So Python almost always allocates more memory than is needed for the current contents of a list, allowing any individual addition to the list to have a very good chance of being performed without needing to move memory. For similar reasons, a list may not release the memory for deleted elements immediately, or ever.
However, if you delete all the elements at once using a slice assignment:
a[:] = []
that seems to reset it. This is an implementation detail, however.
When you append an item to a Python list, it allocates a given amount of memory if the already allocated memory for the list is full. When you remove an item from a list, it keeps memory allocated for the next time you would append items to the list. See this related post for an example.
I have just found these performance notes for cPython lists:
Time needed for python lists to ....
... get or set an individual item: O(1)
... append an item to the list: worst O(n^2), but usually O(1)
... insert an item: O(n), where n is the number of elements after the inserted one
... remove an item: O(n)
Now I would like to know the same performance characteristics for cPython sets. Also, I would like how fast iteration over the list / set is. I am especially interested in large lists / sets.
AFAIK, the Python "specification" does not impose specific data structures for implementation of lists, dictionaries or sets, so this can't be answered "officially". If you're only concerned about CPython (the reference implementation), then we can throw in some un-official complexities. You might want to re-formulate your question to target a specific Python implementation.
In any case, the complexities you mentioned can't be right. Supposing a dynamically-resized array implementation, appending an item is amortized O(1): most often you simple copy the new value, and in the worst case, you need to re-allocate, copying all n items, plus the new one. Next, inserting has exactly the same worst case scenario, so it has the same upper bound on complexity, but in the best case, it only moves k items, where k is the number of items past the position where you're inserting.