Python data structures - sorting big O complexity implementation - python

We all have been told the popular theoretical limit for a general case sorting of objects to be O(n*log(n)) in many languages.
Let's say we have a list:
lst = [1,1,2,3,4,5,3,2,3,4,2,1,2,3]
In Python, I was recently introduced to some additional benefits of using Counter (from collections import Counter) over dictionary with keys as list numbers and values as their occurrence counter.
coun = Counter(lst)
print(coun) # ==> Counter({2: 4, 3: 4, 1: 3, 4: 2, 5: 1})
It was shown several times (What is the time complexity of collections.Counter() in Python?) that construction of Counter takes O(n) and unlike standard dict, Counter() has some additional space overhead to store frequencies of each element.
When you attempt to work with Counter, it often returns outputs in the sorted order:
.items() or .keys(). Maybe for sake of convenience it applies a quick O(logn) sorting before giving you the result, but it sounds unexpectedly bad, when you will use it in simple traversing:
for i in range(len(lst)):
if lst[i] not in coun.keys():
print("element", lst[i], "not found!")
You would naturally expect that complexity of above will be O(n) as in a standard dictionary (check presence is O(1) over n loops).
So without picking in the code, let's just assume that lst[i] not in coun.keys() is implemented in a O(1) complexity, using some space overhead.
Is it theoretically possible that during counter construction, this additional (potentially prohibitively large for really big and unique lists) space overhead is giving us an edge in small and medium sized lists (length < 1000) to get the O(n) sorting advantage at a cost of using extra space.
If above is possible, I assume that behind the scenes there is a mechanism that will stop counting every single element and putting them into the correct sorted order, when memory footprint is breaking some defined value (like 1Mb) and lst[i] not in coun.keys() becomes O(logn).
Just thinking outloud here, as in reality a lot of lists that we are working with actually are less than 1000 elements.
Afterthought 1:
On the other hand, you probably wouldn't care much for O(n) vs O(nlogn) when n<1000 it will be barely noticable time gain at a potentially huge space overhead price.
Afterthought 2:
It appears that .keys() is preserving the insertion order, that just happened to be the same as sorted order due to my poor initial data set.
Nevertheless, is it possible to have implementation of data structure that will place the counted objects in the right places at the moment of adding them?

The O(n*log(n)) lower bound on sorting algorithms only applies to algorithms that can sort arbitrary objects by comparing them to one another. If you know that your data is from a limited domain, you can use more efficient algorithms. For example, if the values are all small integers you can use a counting sort to efficiently sort the data in O(n) time.
Here's an example that can sort sequences that only contain integers from the domain 0-5, like in your example.
def sort_0_to_5(data):
counts = [0, 0, 0, 0, 0, 0]
for val in data:
counts[val] += 1
return [val for val in range(len(counts)) for _ in range(counts[val])]
This runs in O(n) time and uses only constant space. This is a very basic counting sort, a fancier version could sort arbitrary objects as long as they have integer keys within the domain. (You just need a couple extra passes over the data to make cumulative counts and then to build up the output in the right order.)
More sophisticated algorithms like radix sort can handle much larger domains in quasi-linear time. The way you need to account for time gets tricky though, as once the domain starts getting comparable to the size of the data set, the less "constant" the parts of the code that deal with the domain size become. Radix sort, for example, takes O(n*log(k)) time where k is the size of the domain.
I'd note however that even if you can figure out a sorting algorithm that has a better time complexity than the standard comparison sorts, that may not actually mean it is faster for your actual data. Unless the size of your data set is huge, the constant terms that get excluded from asymptotic analysis are likely to matter quite a lot. You may find that a very well implemented O(n*log(n)) sort (like the one behind Python's sorted) performs better than a O(n) sort you've coded up by hand.

Related

Does this sentence contradicts the python paradigm "list should not be initialized"?

People coming from other coding languages to python often ask how they should pre-allocate or initialize their list. This is especially true for people coming from Matlab where codes as
l = []
for i = 1:100
l(end+1) = 1;
end
returns a warning that explicitly suggest you to initialize the list.
There are several posts on SO explaining (and showing through tests) that list initialization isn't required in python. A good example with a fair bit of discussion is this one (but the list could be very long): Create a list with initial capacity in Python
The other day, however, while looking for operations complexity in python, I stumbled this sentence on the official python wiki:
the largest [cost for list operations] come from growing beyond the current allocation size (because everything must move),
This seems to suggest that indeed lists do have a pre-allocation size and that growing beyond that size cause the whole list to move.
This shacked a bit my foundations. Can list pre-allocation reduce the overall complexity (in terms of number of operations) of a code? If not, what does that sentence means?
EDIT:
Clearly my question regards the (very common) code:
container = ... #some iterable with 1 gazilion elements
new_list = []
for x in container:
... #do whatever you want with x
new_list.append(x) #or something computed using x
In this case the compiler cannot know how many items there are in container, so new_list could potentially require his allocated memory to change an incredible number of times if what is said in that sentence is true.
I know that this is different for list-comprehensions
Can list pre-allocation reduce the overall complexity (in terms of number of operations) of a code?
No, the overall time complexity of the code will be the same, because the time cost of reallocating the list is O(1) when amortised over all of the operations which increase the size of the list.
If not, what does that sentence means?
In principle, pre-allocating the list could reduce the running time by some constant factor, by avoiding multiple re-allocations. This doesn't mean the complexity is lower, but it may mean the code is faster in practice. If in doubt, benchmark or profile the relevant part of your code to compare the two options; in most circumstances it won't matter, and when it does, there are likely to be better alternatives anyway (e.g. NumPy arrays) for achieving the same goal.
new_list could potentially require his allocated memory to change an incredible number of times
List reallocation follows a geometric progression, so if the final length of the list is n then the list is reallocated only O(log n) times along the way; not an "incredible number of times". The way the maths works out, the average number of times each element gets copied to a new underlying array is a constant regardless of how large the list gets, hence the O(1) amortised cost of appending to the list.

Insert and get max from structure in constant time

I need a data structure to store positive (not necessarily integer) values. It must support the following two operations in sublinear time:
Add an element.
Remove the largest element.
Also, the largest key may scale as N^2, N being the number of elements. In principle, having O(N^2) space requirement wouldn't be a big problem, but if a more efficient option exists in terms of store, it would work better.
I am working in Python, so if such a data structure exists, it would be of help to have an implementation in this language.
There is no such data structure. For example, if there were, sorting would be worst-case linear time: add all N elements in O(N) time, then remove the largest element remaining N times, again in total O(N) time.
the best data structure you can choose for this operations is the heap: https://www.tutorialspoint.com/python_data_structure/python_heaps.htm#:~:text=Heap%20is%20a%20special%20tree,is%20called%20a%20max%20heap.
with this data structure both adding an element and removing the max are O(log(n)).
this is the most used data structure when you need a lot of operations on the max element, for example is commonly used to implement priority queues
Although constant time may be impossible, depending on your input constraints, you might consider a y-fast-trie, which has O(log log m) time operations and O(n) space, where m is the range, although they work with integers, taking advantage of the bit structure. One of the supported operations is next higher or lower element, which could let you keep track of the highest when the latter is removed.

Big O Notation of a module nested in a built-in method in Python

I'm wondering about a rationale for determining the big O values behind built-in methods in Python. Given the following operation:
i_arr = ['1','2','5','3','4','5','5']
s_arr = sorted(set(i_arr))
My rationale is that this is sorted(set) = n^2 meaning two nested loops.
Based on advice I dug into the higher level modules as part of my original function in the Python source code:
def sorted(lst):
l = list(lst)
l.sort()
return l
def sort(self):
sortable_files = sorted(map(os.path.split, self.files))
def set(self):
if not self._value:
self._value = True
for fut in self._waiters:
if not fut.done():
fut.set_result(True)
Now I'm really confused :)
Upper bound of complexity of sorting algorithm is generally O(n*log(n)), as can be seen on wikipedia.
Then it depends how long it takes to cast the array to the set. From example you posted, we can see list is iterated over and in each step, value is checked whether it already is in the set or not.
According to this the checking of existence of element in set has constant complexity, O(1), so whole constructing of set has complexity O(n) because of iterating over all elements.
But assuming that it is implemented by adding each element one by one, we must iterate over list to transform it to the set, which is O(n), and then we sort it, which is O(n*log(n)), and that results to total complexity O(n*log(n) + n) = O(n*log(n)).
Update:
Mapping also has linear complexity, because it iterates over whole set and maps each element, but since linear function grows slower than n*log(n), any linear operation here does not affect the O(n*log(n)), so the asymptotic complexity remains the same even with the mapping.
Beyond that "generic sorting" performance answers; as you are mentioning sorted() I assume you are asking about the big O values behind that built-in method in python.
Well, python is using Timsort; and quoting from wikipedia:
Timsort is a hybrid stable sorting algorithm, derived from merge sort and insertion sort, designed to perform well on many kinds of real-world data.
In the worst case, Timsort takes O(n log n) comparisons to sort an array of n elements. In the best case, which occurs when the input is already sorted, it runs in linear time, meaning that it is an adaptive sorting algorithm.
(N logN) is the lower bound for some sorting algorithms​ like quick sort, heap sort, merge sort. They are generic comparison based algorithms. In worst case it can be O(N*N).
If the elements are going to be numeric, there are O(N) upper bound algorithms available like radix sort, counting sort and bucket sort.
If we are takiing about python library 'sorted' implementation, which TimSort implementation. TimSort is an optimised mergesort algorithm. It is stable and faster than the regular mergesort algorithm. It runs on O(N log N) in worst case and O(N) in best case (when list is already sorted).

Why bisect slower than sort

I know that bisect is using binary search to keep lists sorted. However I did a timing test that the values are being read and sorted. But, on the contrary to my knowledge, keeping the values and then sorting them win the timing by high difference. Could more experienced users please explain this behavior ? Here is the code I use to test the timings.
import timeit
setup = """
import random
import bisect
a = range(100000)
random.shuffle(a)
"""
p1 = """
b = []
for i in a:
b.append(i)
b.sort()
"""
p2 = """
b = []
for i in a:
bisect.insort(b, i)
"""
print timeit.timeit(p1, setup=setup, number = 1)
print timeit.timeit(p2, setup=setup, number = 1)
# 0.0593081859178
# 1.69218442959
# Huge difference ! 35x faster.
In the first process I take values one-by-one instead of just sorting a to obtain a behavior like file reading. And it beats bisect very hard.
Sorting a list takes about O(N*log(N)) time. Appending N items to a list takes O(N) time. Doing these things consecutively takes about O(N*log(N)) time.
Bisecting a list takes O(log(n)) time. Inserting an item into a list takes O(N) time. Doing both N times inside a for loop takes O(N * (N + log(n))) == O(N^2) time.
O(N^2) is worse than O(N*log(N)), so your p1 is faster than your p2.
Your algorithmic complexity will be worse in the bisect case ...
In the bisect case, you have N operations (each at an average cost of log(N) to find the insertion point and then an additional O(N) step to insert the item). Total complexity: O(N^2).
With sort, you have a single Nlog(N) step (plus N O(1) steps to build the list in the first place). Total complexity: O(Nlog(N))
Also note that sort is implemented in very heavily optimized C code (bisect isn't quite as optimized since it ends up calling various comparison functions much more frequently...)
To understand the time difference, let’s look at what you are actually doing there.
In your first example, you are taking an empty list, and append items to it, and sorting it in the end.
Appending to lists is really cheap, it has an amortized time complexity of O(1). It cannot be really constant time because the underlying data structure, a simple array, eventually needs to be expanded as the list grows. This is done every so often which causes a new array to be allocated and the data being copied. That’s a bit more expensive. But in general, we still say this is O(1).
Next up comes the sorting. Python is using Timsort which is very efficient. This is O(n log n) at average and worst case. So overall, we get constant time following O(n log n) so the sorting is the only thing that matters here. In total, this is pretty simple and very fast.
The second example uses bisect.insort. This utilizes a list and binary search to ensure that the list is sorted at all times.
Essentially, on every insert, it will use binary search to find the correct location to insert the new value, and then shift all items correctly to make room at that index for the new value. Binary search is cheap, O(log n) on average, so this is not a problem. Shifting alone is also not that difficult. In the worst case, we need to move all items one index to the right, so we get O(n) (this is basically the insert operation on lists).
So in total, we would get linear time at worst. However, we do this on every single iteration. So when inserting n elements, we have O(n) each time. This results in a quadratic complexity, O(n²). This is a problem, and will ultimately slow the whole thing down.
So what does this tell us? Sorted inserting into a list to get a sorted result is not really performant. We can use the bisect module to keep an already sorted list ordered when we only do a few operations, but when we actually have unsorted data, it’s easier to sort the data as a whole.
Insertion and deletion operations in a data structure can be surprisingly expensive sometimes, particularly if the distribution of incoming data values is random. Whereas, sorting can be unexpectedly fast.
A key consideration is whether-or-not you can "accumulate all the values," then sort them once, then use the sorted result "all at once." If you can, then sorting is almost always very-noticeably faster.
If you remember the old sci-fi movies (back when computers were called "giant brains" and a movie always had spinning tape-drives), that's the sort of processing that they were supposedly doing: applying sorted updates to also-sorted master tapes, to produce a new still-sorted master. Random-access was not needed. (Which was a good thing, because at that time we really couldn't do it.) It is still an efficient way to process vast amounts of data.

Time differences when popping out items from dictionary of different lengths

I am designing a software in Python and I was getting little curious about whether there is any time differences when popping out items from a dictionary of very small lengths and when popping out items from a dictionary of very large length or it is same in all cases.
You can easily answer this question for yourself using the timeit module. But the entire point of a dictionary is near-instant access to any desired element by key, so I would not expect to have a large difference between the two scenarios.
Check out this article on Python TimeComplexity:
The Average Case times listed for dict objects assume that the hash
function for the objects is sufficiently robust to make collisions
uncommon. The Average Case assumes the keys used in parameters are
selected uniformly at random from the set of all keys.
Note that there is a fast-path for dicts that (in practice) only deal
with str keys; this doesn't affect the algorithmic complexity, but it
can significantly affect the constant factors: how quickly a typical
program finishes.
According to this article, for a 'Get Item' operation, the average case is O(1), with a worse case of O(n). In other words, the worst case is that the time increases linearly with size. See Big O Notation on Wikipedia for more information.

Categories