Which performance have cPython sets in comparison to lists?

Which performance have cPython sets in comparison to lists? - python

I have just found these performance notes for cPython lists:
Time needed for python lists to ....
... get or set an individual item: O(1)
... append an item to the list: worst O(n^2), but usually O(1)
... insert an item: O(n), where n is the number of elements after the inserted one
... remove an item: O(n)
Now I would like to know the same performance characteristics for cPython sets. Also, I would like how fast iteration over the list / set is. I am especially interested in large lists / sets.

AFAIK, the Python "specification" does not impose specific data structures for implementation of lists, dictionaries or sets, so this can't be answered "officially". If you're only concerned about CPython (the reference implementation), then we can throw in some un-official complexities. You might want to re-formulate your question to target a specific Python implementation.
In any case, the complexities you mentioned can't be right. Supposing a dynamically-resized array implementation, appending an item is amortized O(1): most often you simple copy the new value, and in the worst case, you need to re-allocate, copying all n items, plus the new one. Next, inserting has exactly the same worst case scenario, so it has the same upper bound on complexity, but in the best case, it only moves k items, where k is the number of items past the position where you're inserting.

Related

Python data structures - sorting big O complexity implementation

We all have been told the popular theoretical limit for a general case sorting of objects to be O(n*log(n)) in many languages.
Let's say we have a list:
lst = [1,1,2,3,4,5,3,2,3,4,2,1,2,3]
In Python, I was recently introduced to some additional benefits of using Counter (from collections import Counter) over dictionary with keys as list numbers and values as their occurrence counter.
coun = Counter(lst)
print(coun) # ==> Counter({2: 4, 3: 4, 1: 3, 4: 2, 5: 1})
It was shown several times (What is the time complexity of collections.Counter() in Python?) that construction of Counter takes O(n) and unlike standard dict, Counter() has some additional space overhead to store frequencies of each element.
When you attempt to work with Counter, it often returns outputs in the sorted order:
.items() or .keys(). Maybe for sake of convenience it applies a quick O(logn) sorting before giving you the result, but it sounds unexpectedly bad, when you will use it in simple traversing:
for i in range(len(lst)):
if lst[i] not in coun.keys():
print("element", lst[i], "not found!")
You would naturally expect that complexity of above will be O(n) as in a standard dictionary (check presence is O(1) over n loops).
So without picking in the code, let's just assume that lst[i] not in coun.keys() is implemented in a O(1) complexity, using some space overhead.
Is it theoretically possible that during counter construction, this additional (potentially prohibitively large for really big and unique lists) space overhead is giving us an edge in small and medium sized lists (length < 1000) to get the O(n) sorting advantage at a cost of using extra space.
If above is possible, I assume that behind the scenes there is a mechanism that will stop counting every single element and putting them into the correct sorted order, when memory footprint is breaking some defined value (like 1Mb) and lst[i] not in coun.keys() becomes O(logn).
Just thinking outloud here, as in reality a lot of lists that we are working with actually are less than 1000 elements.
Afterthought 1:
On the other hand, you probably wouldn't care much for O(n) vs O(nlogn) when n<1000 it will be barely noticable time gain at a potentially huge space overhead price.
Afterthought 2:
It appears that .keys() is preserving the insertion order, that just happened to be the same as sorted order due to my poor initial data set.
Nevertheless, is it possible to have implementation of data structure that will place the counted objects in the right places at the moment of adding them?

The O(n*log(n)) lower bound on sorting algorithms only applies to algorithms that can sort arbitrary objects by comparing them to one another. If you know that your data is from a limited domain, you can use more efficient algorithms. For example, if the values are all small integers you can use a counting sort to efficiently sort the data in O(n) time.
Here's an example that can sort sequences that only contain integers from the domain 0-5, like in your example.
def sort_0_to_5(data):
counts = [0, 0, 0, 0, 0, 0]
for val in data:
counts[val] += 1
return [val for val in range(len(counts)) for _ in range(counts[val])]
This runs in O(n) time and uses only constant space. This is a very basic counting sort, a fancier version could sort arbitrary objects as long as they have integer keys within the domain. (You just need a couple extra passes over the data to make cumulative counts and then to build up the output in the right order.)
More sophisticated algorithms like radix sort can handle much larger domains in quasi-linear time. The way you need to account for time gets tricky though, as once the domain starts getting comparable to the size of the data set, the less "constant" the parts of the code that deal with the domain size become. Radix sort, for example, takes O(n*log(k)) time where k is the size of the domain.
I'd note however that even if you can figure out a sorting algorithm that has a better time complexity than the standard comparison sorts, that may not actually mean it is faster for your actual data. Unless the size of your data set is huge, the constant terms that get excluded from asymptotic analysis are likely to matter quite a lot. You may find that a very well implemented O(n*log(n)) sort (like the one behind Python's sorted) performs better than a O(n) sort you've coded up by hand.

Is my understanding of Hashsets correct?(Python)

I'm teaching myself data structures through this python book and I'd appreciate if someone can correct me if I'm wrong since a hash set seems to be extremely similar to a hash map.
Implementation:
A Hashset is a list [] or array where each index points to the head of a linkedlist
So some hash(some_item) --> key, and then list[key] and then add to the head of a LinkedList. This occurs in O(1) time
When removing a value from the linkedlist, in python we replace it with a placeholder because hashsets are not allowed to have Null/None values, correct?
When the list[] gets over a certain % of load/fullness, we copy it over to another list
Regarding Time Complexity Confusion:
So one question is, why is Average search/access O(1) if there can be a list of N items at the linkedlist at a given index?
Wouldnt the average case be the searchitem is in the middle of its indexed linkedlist so it should be O(n/2) -> O(n)?
Also, when removing an item, if we are replacing it with a placeholder value, isn't this considered a waste of memory if the placeholder is never used?
And finally, what is the difference between this and a HashMap other than HashMaps can have nulls? And HashMaps are key/value while Hashsets are just value?

For your first question - why is the average time complexity of a lookup O(1)? - this statement is in general only true if you have a good hash function. An ideal hash function is one that causes a nice spread on its elements. In particular, hash functions are usually chosen so that the probability that any two elements collide is low. Under this assumption, it's possible to formally prove that the expected number of elements to check is O(1). If you search online for "universal family of hash functions," you'll probably find some good proofs of this result.
As for using placeholders - there are several different ways to implement a hash table. The approach you're using is called "closed addressing" or "hashing with chaining," and in that approach there's little reason to use placeholders. However, other hashing strategies exist as well. One common family of approaches is called "open addressing" (the most famous of which is linear probing hashing), and in those setups placeholder elements are necessary to avoid false negative lookups. Searching online for more details on this will likely give you a good explanation about why.
As for how this differs from HashMap, the HashMap is just one possible implementation of a map abstraction backed by a hash table. Java's HashMap does support nulls, while other approaches don't.

The lookup time wouldn't be O(n) because not all items need to be searched, it also depends on the number of buckets. More buckets would decrease the probability of a collision and reduce the chain length.
The number of buckets can be kept as a constant factor of the number of entries by resizing the hash table as needed. Along with a hash function that evenly distributes the values, this keeps the expected chain length bounded, giving constant time lookups.
The hash tables used by hashmaps and hashsets are the same except they store different values. A hashset will contain references to a single value, and a hashmap will contain references to a key and a value. Hashsets can be implemented by delegating to a hashmap where the keys and values are the same.

A lot has been written here about open hash tables, but some fundamental points are missed.
Practical implementations generally have O(1) lookup and delete because they guarantee buckets won't contain more than a fixed number of items (the load factor). But this means they can only achieve amortized O(1) time for insert because the table needs to be reorganized periodically as it grows.
(Some may opt to reorganize on delete, also, to shrink the table when the load factor reaches some bottom threshold, gut this only affect space, not asymptotic run time.)
Reorganization means increasing (or decreasing) the number of buckets and re-assigning all elements into their new bucket locations. There are schemes, e.g. extensible hashing, to make this a bit cheaper. But in general it means touching each element in the table.
Reorganization, then, is O(n). How can insert be O(1) when any given one may incur this cost? The secret is amortization and the power of powers. When the table is grown, it must be grown by a factor greater than one, two being most common. If the table starts with 1 bucket and doubles each time the load factor reaches F, then the cost of N reorganizations is
F + 2F + 4F + 8F ... (2^(N-1))F = (2^N - 1)F
At this point the table contains (2^(N-1))F elements, the number in the table during the last reorganization. I.e. we have done (2^(N-1))F inserts, and the total cost of reorganization is as shown on the right. The interesting part is the average cost per element in the table (or insert, take your pick):
(2^N - 1)F 2^N
---------- ~= ------- = 2
(2^(N-1))F 2^(N-1)
That's where the amortized O(1) comes from.
One additional point is that for modern processors, linked lists aren't a great idea for the bucket lists. With 8-byte pointers, the overhead is meaningful. More importantly, heap-allocated nodes in a single list will almost never be contiguous in memory. Traversing such a list kills cache performance, which can slow things down by orders of magnitude.
Arrays (with an integer count for number of data-containing elements) are likely to work out better. If the load factor is small enough, just allocate an array equal in size to the load factor at the time the first element is inserted in the bucket. Otherwise, grow these element arrays by factors the same way as the bucket array! Everything will still amortize to O(1).
To delete an item from such a bucket, don't mark it deleted. Just copy the last array element to the location of the deleted one and decrement the element count. Of course this won't work if you allow external pointers into the hash buckets, but that's a bad idea anyway.

Why bisect slower than sort

I know that bisect is using binary search to keep lists sorted. However I did a timing test that the values are being read and sorted. But, on the contrary to my knowledge, keeping the values and then sorting them win the timing by high difference. Could more experienced users please explain this behavior ? Here is the code I use to test the timings.
import timeit
setup = """
import random
import bisect
a = range(100000)
random.shuffle(a)
"""
p1 = """
b = []
for i in a:
b.append(i)
b.sort()
"""
p2 = """
b = []
for i in a:
bisect.insort(b, i)
"""
print timeit.timeit(p1, setup=setup, number = 1)
print timeit.timeit(p2, setup=setup, number = 1)
# 0.0593081859178
# 1.69218442959
# Huge difference ! 35x faster.
In the first process I take values one-by-one instead of just sorting a to obtain a behavior like file reading. And it beats bisect very hard.

Sorting a list takes about O(N*log(N)) time. Appending N items to a list takes O(N) time. Doing these things consecutively takes about O(N*log(N)) time.
Bisecting a list takes O(log(n)) time. Inserting an item into a list takes O(N) time. Doing both N times inside a for loop takes O(N * (N + log(n))) == O(N^2) time.
O(N^2) is worse than O(N*log(N)), so your p1 is faster than your p2.

Your algorithmic complexity will be worse in the bisect case ...
In the bisect case, you have N operations (each at an average cost of log(N) to find the insertion point and then an additional O(N) step to insert the item). Total complexity: O(N^2).
With sort, you have a single Nlog(N) step (plus N O(1) steps to build the list in the first place). Total complexity: O(Nlog(N))
Also note that sort is implemented in very heavily optimized C code (bisect isn't quite as optimized since it ends up calling various comparison functions much more frequently...)

To understand the time difference, let’s look at what you are actually doing there.
In your first example, you are taking an empty list, and append items to it, and sorting it in the end.
Appending to lists is really cheap, it has an amortized time complexity of O(1). It cannot be really constant time because the underlying data structure, a simple array, eventually needs to be expanded as the list grows. This is done every so often which causes a new array to be allocated and the data being copied. That’s a bit more expensive. But in general, we still say this is O(1).
Next up comes the sorting. Python is using Timsort which is very efficient. This is O(n log n) at average and worst case. So overall, we get constant time following O(n log n) so the sorting is the only thing that matters here. In total, this is pretty simple and very fast.
The second example uses bisect.insort. This utilizes a list and binary search to ensure that the list is sorted at all times.
Essentially, on every insert, it will use binary search to find the correct location to insert the new value, and then shift all items correctly to make room at that index for the new value. Binary search is cheap, O(log n) on average, so this is not a problem. Shifting alone is also not that difficult. In the worst case, we need to move all items one index to the right, so we get O(n) (this is basically the insert operation on lists).
So in total, we would get linear time at worst. However, we do this on every single iteration. So when inserting n elements, we have O(n) each time. This results in a quadratic complexity, O(n²). This is a problem, and will ultimately slow the whole thing down.
So what does this tell us? Sorted inserting into a list to get a sorted result is not really performant. We can use the bisect module to keep an already sorted list ordered when we only do a few operations, but when we actually have unsorted data, it’s easier to sort the data as a whole.

Insertion and deletion operations in a data structure can be surprisingly expensive sometimes, particularly if the distribution of incoming data values is random. Whereas, sorting can be unexpectedly fast.
A key consideration is whether-or-not you can "accumulate all the values," then sort them once, then use the sorted result "all at once." If you can, then sorting is almost always very-noticeably faster.
If you remember the old sci-fi movies (back when computers were called "giant brains" and a movie always had spinning tape-drives), that's the sort of processing that they were supposedly doing: applying sorted updates to also-sorted master tapes, to produce a new still-sorted master. Random-access was not needed. (Which was a good thing, because at that time we really couldn't do it.) It is still an efficient way to process vast amounts of data.

Time differences when popping out items from dictionary of different lengths

I am designing a software in Python and I was getting little curious about whether there is any time differences when popping out items from a dictionary of very small lengths and when popping out items from a dictionary of very large length or it is same in all cases.

You can easily answer this question for yourself using the timeit module. But the entire point of a dictionary is near-instant access to any desired element by key, so I would not expect to have a large difference between the two scenarios.

Check out this article on Python TimeComplexity:
The Average Case times listed for dict objects assume that the hash
function for the objects is sufficiently robust to make collisions
uncommon. The Average Case assumes the keys used in parameters are
selected uniformly at random from the set of all keys.
Note that there is a fast-path for dicts that (in practice) only deal
with str keys; this doesn't affect the algorithmic complexity, but it
can significantly affect the constant factors: how quickly a typical
program finishes.
According to this article, for a 'Get Item' operation, the average case is O(1), with a worse case of O(n). In other words, the worst case is that the time increases linearly with size. See Big O Notation on Wikipedia for more information.

Efficient reduction of a list in python

So I have a list of 85 items. I would like to continually reduce this list in half (essentially a binary search on the items) -- my question is then, what is the most efficient way to reduce the list? A list comprehension would continually create copies of the list which is not ideal. I would like in-place removal of ranges of my list until I am left with one element.
I'm not sure if this is relevant but I'm using collections.deque instead of a standard list. They probably work the same way more or less though so I doubt this matters.

For a mere 85 items, truthfully, almost any method you want to use would be more than fast enough. Don't optimize prematurely.
That said, depending on what you're actually doing, a list may be faster than a deque. A deque is faster for adding and removing items at either end, but it doesn't support slicing.
With a list, if you want to copy or delete a contiguous range of items (say, the first 42) you can do this with a slice. Assuming half the list is eliminated at each pass, copying items to a new list would be slower on average than deleting items from the existing list (deleting requires moving the half of the list that's not being deleted "leftward" in memory, which would be about the same time cost as copying the other half, but you won't always need to do this; deleting the latter half of a list won't need to move anything).
To do this with a deque efficiently, you would want to pop() or popleft() the items rather than slicing them (lots of attribute access and method calls, which are relatively expensive in Python), and you'd have to write the loop that controls the operation in Python, which will be slower than the native slice operation.
Since you said it's basically a binary search, it is probably actually fastest to simply find the item you want to keep without modifying the original container at all, and then return a new container holding that single item. A list is going to be faster for this than a deque since you will be doing a lot of accessing items by index. To do this in a deque will require Python to follow the linked list from the beginning each time you access an item, while accessing an item by index is a simple, fast calculation for a list.

collections.deque is implemented via a linked list, hence binary search would be much slower than a linear search. Rethink your approach.

Not sure that this is what you really need but:
x = range(100)
while len(x) > 1:
if condition:
x = x[:int(len(x)/2)]
else:
x = x[int(len(x)/2):]

85 items are not even worth thinking about. Computers are fast, really.
Why would you delete ranges from the list, instead of simply picking the one result?
If there is a good reason why you cant do (2): Keep the original list and change two indices only: The start and end index of the sublist you're looking at.

On a previous question I compared a number of techniques for removing a list of items given a predicate. (That is, I have a function which returns True or False for whether to keep a particular item.) As I recall using a list comprehension was the fastest. The fact is, copying is really really cheap.
The only thing that you can do to improve the speed depends on which items you are removing. But you haven't indicated anything about that so I can't suggest anything.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.