Insert and get max from structure in constant time - python

I need a data structure to store positive (not necessarily integer) values. It must support the following two operations in sublinear time:
Add an element.
Remove the largest element.
Also, the largest key may scale as N^2, N being the number of elements. In principle, having O(N^2) space requirement wouldn't be a big problem, but if a more efficient option exists in terms of store, it would work better.
I am working in Python, so if such a data structure exists, it would be of help to have an implementation in this language.

There is no such data structure. For example, if there were, sorting would be worst-case linear time: add all N elements in O(N) time, then remove the largest element remaining N times, again in total O(N) time.

the best data structure you can choose for this operations is the heap: https://www.tutorialspoint.com/python_data_structure/python_heaps.htm#:~:text=Heap%20is%20a%20special%20tree,is%20called%20a%20max%20heap.
with this data structure both adding an element and removing the max are O(log(n)).
this is the most used data structure when you need a lot of operations on the max element, for example is commonly used to implement priority queues

Although constant time may be impossible, depending on your input constraints, you might consider a y-fast-trie, which has O(log log m) time operations and O(n) space, where m is the range, although they work with integers, taking advantage of the bit structure. One of the supported operations is next higher or lower element, which could let you keep track of the highest when the latter is removed.

Related

Python data structures - sorting big O complexity implementation

We all have been told the popular theoretical limit for a general case sorting of objects to be O(n*log(n)) in many languages.
Let's say we have a list:
lst = [1,1,2,3,4,5,3,2,3,4,2,1,2,3]
In Python, I was recently introduced to some additional benefits of using Counter (from collections import Counter) over dictionary with keys as list numbers and values as their occurrence counter.
coun = Counter(lst)
print(coun) # ==> Counter({2: 4, 3: 4, 1: 3, 4: 2, 5: 1})
It was shown several times (What is the time complexity of collections.Counter() in Python?) that construction of Counter takes O(n) and unlike standard dict, Counter() has some additional space overhead to store frequencies of each element.
When you attempt to work with Counter, it often returns outputs in the sorted order:
.items() or .keys(). Maybe for sake of convenience it applies a quick O(logn) sorting before giving you the result, but it sounds unexpectedly bad, when you will use it in simple traversing:
for i in range(len(lst)):
if lst[i] not in coun.keys():
print("element", lst[i], "not found!")
You would naturally expect that complexity of above will be O(n) as in a standard dictionary (check presence is O(1) over n loops).
So without picking in the code, let's just assume that lst[i] not in coun.keys() is implemented in a O(1) complexity, using some space overhead.
Is it theoretically possible that during counter construction, this additional (potentially prohibitively large for really big and unique lists) space overhead is giving us an edge in small and medium sized lists (length < 1000) to get the O(n) sorting advantage at a cost of using extra space.
If above is possible, I assume that behind the scenes there is a mechanism that will stop counting every single element and putting them into the correct sorted order, when memory footprint is breaking some defined value (like 1Mb) and lst[i] not in coun.keys() becomes O(logn).
Just thinking outloud here, as in reality a lot of lists that we are working with actually are less than 1000 elements.
Afterthought 1:
On the other hand, you probably wouldn't care much for O(n) vs O(nlogn) when n<1000 it will be barely noticable time gain at a potentially huge space overhead price.
Afterthought 2:
It appears that .keys() is preserving the insertion order, that just happened to be the same as sorted order due to my poor initial data set.
Nevertheless, is it possible to have implementation of data structure that will place the counted objects in the right places at the moment of adding them?
The O(n*log(n)) lower bound on sorting algorithms only applies to algorithms that can sort arbitrary objects by comparing them to one another. If you know that your data is from a limited domain, you can use more efficient algorithms. For example, if the values are all small integers you can use a counting sort to efficiently sort the data in O(n) time.
Here's an example that can sort sequences that only contain integers from the domain 0-5, like in your example.
def sort_0_to_5(data):
counts = [0, 0, 0, 0, 0, 0]
for val in data:
counts[val] += 1
return [val for val in range(len(counts)) for _ in range(counts[val])]
This runs in O(n) time and uses only constant space. This is a very basic counting sort, a fancier version could sort arbitrary objects as long as they have integer keys within the domain. (You just need a couple extra passes over the data to make cumulative counts and then to build up the output in the right order.)
More sophisticated algorithms like radix sort can handle much larger domains in quasi-linear time. The way you need to account for time gets tricky though, as once the domain starts getting comparable to the size of the data set, the less "constant" the parts of the code that deal with the domain size become. Radix sort, for example, takes O(n*log(k)) time where k is the size of the domain.
I'd note however that even if you can figure out a sorting algorithm that has a better time complexity than the standard comparison sorts, that may not actually mean it is faster for your actual data. Unless the size of your data set is huge, the constant terms that get excluded from asymptotic analysis are likely to matter quite a lot. You may find that a very well implemented O(n*log(n)) sort (like the one behind Python's sorted) performs better than a O(n) sort you've coded up by hand.

Hash computation vs bucket walkthrough

I have a nested r-tree like datastructure in Python (list of lists). The key is a large number (about 10 digits). On each level there are about x number of items (eg:10) in the list. Then within each list, it recurses and has x items and so on. The height of the tree is h levels (eg: 5). Each level also has an indication of what range of keys it contains (like r-tree).
For a given key, I need to locate the corresponding entry in the tree. This can be trivially done by scanning through each level, check if the given key lies within the range. If so, then step into that layer and recurse till it reaches the leaf.
This can also be done by successively dividing the key by x and taking the quotient as list index.
So the question is, what is more effecient : walking through list sequentially (complexity = depth * x (eg:50)) or successively dividing the large number by x to get the actual list indices (complexity = h divisions (eg: 5 divisions)).
(ie) 50 range checks or 5 divisions ?
This needs to be scalable. So if this code is being accessed in cloud by very large number of users, what is efficient ? May be division is more expensive to perform at scale than range checks ?
You need to benchmark the code in somewhat realistic scenario.
The reason why it's so hard to say is that you are not just comparing division (by the way, modern compilers avoid divisions with a large number of tricks). On modern CPUs you have large caches so likely the list will fit into L2 or L3 which decreases the run-time dramatically. There's also the fancy vector/SIMD instructions that might be used to speed up all the checks in the linear case.
I would guess that going through the list sequentially will be faster, in addition the code will be simpler.
But don't take my word for it, take a real example and benchmark the two versions and pick based on the results. Especially if this is critical for your system's performance.

Is my understanding of Hashsets correct?(Python)

I'm teaching myself data structures through this python book and I'd appreciate if someone can correct me if I'm wrong since a hash set seems to be extremely similar to a hash map.
Implementation:
A Hashset is a list [] or array where each index points to the head of a linkedlist
So some hash(some_item) --> key, and then list[key] and then add to the head of a LinkedList. This occurs in O(1) time
When removing a value from the linkedlist, in python we replace it with a placeholder because hashsets are not allowed to have Null/None values, correct?
When the list[] gets over a certain % of load/fullness, we copy it over to another list
Regarding Time Complexity Confusion:
So one question is, why is Average search/access O(1) if there can be a list of N items at the linkedlist at a given index?
Wouldnt the average case be the searchitem is in the middle of its indexed linkedlist so it should be O(n/2) -> O(n)?
Also, when removing an item, if we are replacing it with a placeholder value, isn't this considered a waste of memory if the placeholder is never used?
And finally, what is the difference between this and a HashMap other than HashMaps can have nulls? And HashMaps are key/value while Hashsets are just value?
For your first question - why is the average time complexity of a lookup O(1)? - this statement is in general only true if you have a good hash function. An ideal hash function is one that causes a nice spread on its elements. In particular, hash functions are usually chosen so that the probability that any two elements collide is low. Under this assumption, it's possible to formally prove that the expected number of elements to check is O(1). If you search online for "universal family of hash functions," you'll probably find some good proofs of this result.
As for using placeholders - there are several different ways to implement a hash table. The approach you're using is called "closed addressing" or "hashing with chaining," and in that approach there's little reason to use placeholders. However, other hashing strategies exist as well. One common family of approaches is called "open addressing" (the most famous of which is linear probing hashing), and in those setups placeholder elements are necessary to avoid false negative lookups. Searching online for more details on this will likely give you a good explanation about why.
As for how this differs from HashMap, the HashMap is just one possible implementation of a map abstraction backed by a hash table. Java's HashMap does support nulls, while other approaches don't.
The lookup time wouldn't be O(n) because not all items need to be searched, it also depends on the number of buckets. More buckets would decrease the probability of a collision and reduce the chain length.
The number of buckets can be kept as a constant factor of the number of entries by resizing the hash table as needed. Along with a hash function that evenly distributes the values, this keeps the expected chain length bounded, giving constant time lookups.
The hash tables used by hashmaps and hashsets are the same except they store different values. A hashset will contain references to a single value, and a hashmap will contain references to a key and a value. Hashsets can be implemented by delegating to a hashmap where the keys and values are the same.
A lot has been written here about open hash tables, but some fundamental points are missed.
Practical implementations generally have O(1) lookup and delete because they guarantee buckets won't contain more than a fixed number of items (the load factor). But this means they can only achieve amortized O(1) time for insert because the table needs to be reorganized periodically as it grows.
(Some may opt to reorganize on delete, also, to shrink the table when the load factor reaches some bottom threshold, gut this only affect space, not asymptotic run time.)
Reorganization means increasing (or decreasing) the number of buckets and re-assigning all elements into their new bucket locations. There are schemes, e.g. extensible hashing, to make this a bit cheaper. But in general it means touching each element in the table.
Reorganization, then, is O(n). How can insert be O(1) when any given one may incur this cost? The secret is amortization and the power of powers. When the table is grown, it must be grown by a factor greater than one, two being most common. If the table starts with 1 bucket and doubles each time the load factor reaches F, then the cost of N reorganizations is
F + 2F + 4F + 8F ... (2^(N-1))F = (2^N - 1)F
At this point the table contains (2^(N-1))F elements, the number in the table during the last reorganization. I.e. we have done (2^(N-1))F inserts, and the total cost of reorganization is as shown on the right. The interesting part is the average cost per element in the table (or insert, take your pick):
(2^N - 1)F 2^N
---------- ~= ------- = 2
(2^(N-1))F 2^(N-1)
That's where the amortized O(1) comes from.
One additional point is that for modern processors, linked lists aren't a great idea for the bucket lists. With 8-byte pointers, the overhead is meaningful. More importantly, heap-allocated nodes in a single list will almost never be contiguous in memory. Traversing such a list kills cache performance, which can slow things down by orders of magnitude.
Arrays (with an integer count for number of data-containing elements) are likely to work out better. If the load factor is small enough, just allocate an array equal in size to the load factor at the time the first element is inserted in the bucket. Otherwise, grow these element arrays by factors the same way as the bucket array! Everything will still amortize to O(1).
To delete an item from such a bucket, don't mark it deleted. Just copy the last array element to the location of the deleted one and decrement the element count. Of course this won't work if you allow external pointers into the hash buckets, but that's a bad idea anyway.

Why is the following algorithm O(1) space?

Input: A list of positive integers where one entry occurs exactly once, and all other entries occur exactly twice (for example [1,3,2,5,3,4,1,2,4])
Output: The unique entry (5 in the above example)
The following algorithm is supposed to be O(m) time and O(1) space where m is the size of the list.
def get_unique(intlist):
unique_val = 0
for int in intlist:
unique_val ^= int
return unique_val
My analysis: Given a list of length m there will be (m + 1)/2 unique positive integers in the input list, so that the smallest possible maximum integer in the list will be (m+1)/2. If we assume this best case, then when taking an XOR sum the variable unique_val will require ceiling(log((m+1)/2)) bits in memory, so I thought the space complexity should be at least O(log(m)).
Your analysis is certainly one correct answer, particularly in a language like Python which gracefully handles arbitrarily large numbers.
It's important to be clear about what you're trying to measure when thinking about space and time complexity. A reasonable assumption might be that the size of an integer is constant (e.g. you're using 64-bit integers). In that case, the space complexity is certainly O(1), but the time complexity is still O(m).
Now, you could also argue that using a fixed-size integer means you have a constant upper-bound on the size of m, so perhaps the time complexity is also O(1). But in most cases where you need to analyze the running time of this sort of algorithm, you're probably very interested in the difference between a list of length 10 and one of length 1 billion.
I'd say it's important to clarify and state your assumptions when analyzing space- and time-complexity. In this case, I would assume we have a fixed size integer and a value of m much smaller than the maximum integer value. In that case, O(1) space and O(m) time are probably the best answers.
EDIT (based on discussion in other answers)
Since all m gives you is a lower-bound no the maximum value in the list, you really can't provide a worst-case estimate of the space. I.e. a number in the list can be arbitrarily large. To have any reasonable answer as to the space complexity of this algorithm, you need to make some assumption about the maximum size of the input values.
The (space/time) complexity analysis is usually applied to algorithms on a higher level. While you can drop down to specific language implementation level, it may not be useful in all cases.
Your analysis is both right and possibly wrong. It's right for current cpython implementation where integers do not have a maximum value. It's ok if all your integers are relatively small and fit into the implementation-specific case of small numbers.
But it doesn't have to be valid for all other implementations of python. For example, you could have an optimizing implementation which figures out that intlist is not used again and instead of using unique_val, it reuses the space of the consumed list elements. (basically transforming this function into a space-optimized reduce call)
Then again, can we even talk about space complexity in a GC'd language with allocated integers? Your analysis of the complexity is wrong, because a ^= b will allocate new memory for big value b and the size of that depends on the system, architecture, python version, and luck.
Your original question is however "Why is the following algorithm O(1) space?". If you look at the algorithm itself and assume you have some arbitrary maximum integer limits, or your language can represent any number in a limited space, then the answer is yes. The algorithm itself with those conditions uses constant space.
The complexity of an algorithm is always dependent on the machine model (= platform) you use. E.g. we often say that multiplying and dividing IEEE floating point numbers is of run-time complexity O(1) - which is not always the case (e.g. on an 8086 processor without FPU).
For the above algorithm, the space complexity O(1) only holds as long as your input list has no element > 2147483647 (= sys.maxint). Usually, python stores integers as signed 32 bit values. For those datatypes, your processor has all relevant operations already implemented in hardware and it generally takes only a constant number of clock cycles (in most cases only one) to perform them (= run-time complexity O(1)) and only a constant number of memory addresses (only one) is occupied to store the result (= space complexity O(1)).
However, if your input exceeds 2147483647, python generally uses a software-implemented datatype to store these big integers. Operations on these are no longer in O(1) and they require more than constant O(1) space.

find minimal and maximum element in Python

Suppose I am maintaining a set of integers (will add/remove dynamically, and the integers might have duplicate values). And I need to efficiently find max/min element of the current element set. Wondering if any better solutions?
My current solution is maintain a max heap and a min heap. I am using Python 2.7.x and open for any 3rd party pip plug-ins to fit my problem.
Just use min and max function. There is no point in maintaining the heap. You need min/max heap if you want to perform this action (getting min/max) many times while adding removing elements.
Do not forget that to build a heap you need to spend O(n) time, where the constant is 2 (as far as I remember). Only then you can use it's O(log(n)) time to get your min/max.
P.S. Ok now that you have told that you have to call min/max many times you have to make two heaps (min heap and max heap). It will take O(n) to construct each heap, then each operation (add element, remove element, find min/max) will take O(log(n)), where you add/remove elements to both heaps and do min/max on the corresponding heap.
Where as if you will go with min/max functions over the list, your do not need to construct anything and add will take O(1), remove O(n), and min/max O(n), which is way worse than heap.
P.P.S python has heaps
min(list_of_ints)
Will yield your minimum of the list and...
max(list_of_ints)
Will yield your maximum of the list.
Hope that helps....
Using sorted list may help. First element will always be minimum and last element will always be maximum, with O(1) complexity to get the value.
Using binary search to add/remove with sorted list will be O(log(n))

Categories