Using heap for big disk sorts - python

On the official Python docs here, it is mentioned that:
Heaps are also very useful in big disk sorts. You most probably all
know that a big sort implies producing “runs” (which are pre-sorted
sequences, whose size is usually related to the amount of CPU memory),
followed by a merging passes for these runs, which merging is often
very cleverly organised.
It is very important that the initial
sort produces the longest runs possible. Tournaments are a good way
to achieve that. If, using all the memory available to hold a
tournament, you replace and percolate items that happen to fit the
current run, you’ll produce runs which are twice the size of the
memory for random input, and much better for input fuzzily ordered.
Moreover, if you output the 0’th item on disk and get an input which
may not fit in the current tournament (because the value “wins” over
the last output value), it cannot fit in the heap, so the size of the
heap decreases. The freed memory could be cleverly reused
immediately for progressively building a second heap, which grows at
exactly the same rate the first heap is melting.
When the first heap
completely vanishes, you switch heaps and start a new run. Clever and
quite effective!
I am aware of an algorithm called External sorting in which we:
Break down the input into smaller chunks.
Sort all the chunks individually and write them back to disk one-by-one.
Create a heap and do a k-way merge among all the sorted chunks.
I completely understood external sorting as described on Wikipedia, but am not able to understand the author when they say:
If, using all the memory available to hold a tournament, you replace
and percolate items that happen to fit the current run, you’ll produce
runs which are twice the size of the memory for random input, and much
better for input fuzzily ordered.
and:
Moreover, if you output the 0’th item on disk and get an input which
may not fit in the current tournament (because the value “wins” over
the last output value), it cannot fit in the heap, so the size of the
heap decreases.
What is this heap melting?

Heap melting is not a thing. It's just the word the author uses for the heap getting smaller as to pull out the smallest items.
The idea he's talking about is a clever replacement for "divide the input into chunks and sort the chunks" part of the external sort. It produces larger sorted chunks.
The idea is that you first read the biggest chunk you can into memory and arrange it into a heap, then you start writing out the smallest elements from the heap as you read new elements in.
When you read in an element that is smaller than an element you have already written out, it can't go in the current chunk (it would ruin the sort), so you remember it for the next chunk. Elements that are not smaller than the last one you wrote out can be inserted into the heap. They will make it out into the current chunk, making the current chunk larger.
Eventually your heap will be empty. At that point you're done with the current chunk -- heapify all the elements you remembered and start writing out the next chunk.

Related

What is the time complexity of pop() for the set in Python?

I know that pop the last element of the list takes O(1).
And after reading this post
What is the time complexity of popping elements from list in Python?
I notice that if we pop an arbitrary number from a list takes O(n) since all the pointers need to shift one position up.
But for the set, there is no order and no index. So I am not sure if there is still pointers in set?
If not, would the pop() for set is O(1)?
Thanks.
On modern CPython implementations, pop takes amortized constant-ish time (I'll explain further). On Python 2, it's usually the same, but performance can degrade heavily in certain cases.
A Python set is based on a hash table, and pop has to find an occupied entry in the table to remove and return. If it searched from the start of the table every time, this would take time proportional to the number of empty leading entries, and it would get slower with every pop.
To avoid this, the standard CPython implementation tries to remember the position of the last popped entry, to speed up sequences of pops. CPython 3.5+ has a dedicated finger member in the set memory layout to store this position, but earlier versions abuse the hash field of the first hash table entry to store this index.
On any Python version, removing all elements from a set with a sequence of pop operations will take time proportional to the size of the underlying hash table, which is usually within a small constant factor of the original number of elements (unless you'd already removed a bunch of elements). Mixing insertions and pop can interfere heavily with this on Python 2 if inserted elements land in hash table index 0, trashing the search finger. This is much less of an issue on Python 3.

Does this sentence contradicts the python paradigm "list should not be initialized"?

People coming from other coding languages to python often ask how they should pre-allocate or initialize their list. This is especially true for people coming from Matlab where codes as
l = []
for i = 1:100
l(end+1) = 1;
end
returns a warning that explicitly suggest you to initialize the list.
There are several posts on SO explaining (and showing through tests) that list initialization isn't required in python. A good example with a fair bit of discussion is this one (but the list could be very long): Create a list with initial capacity in Python
The other day, however, while looking for operations complexity in python, I stumbled this sentence on the official python wiki:
the largest [cost for list operations] come from growing beyond the current allocation size (because everything must move),
This seems to suggest that indeed lists do have a pre-allocation size and that growing beyond that size cause the whole list to move.
This shacked a bit my foundations. Can list pre-allocation reduce the overall complexity (in terms of number of operations) of a code? If not, what does that sentence means?
EDIT:
Clearly my question regards the (very common) code:
container = ... #some iterable with 1 gazilion elements
new_list = []
for x in container:
... #do whatever you want with x
new_list.append(x) #or something computed using x
In this case the compiler cannot know how many items there are in container, so new_list could potentially require his allocated memory to change an incredible number of times if what is said in that sentence is true.
I know that this is different for list-comprehensions
Can list pre-allocation reduce the overall complexity (in terms of number of operations) of a code?
No, the overall time complexity of the code will be the same, because the time cost of reallocating the list is O(1) when amortised over all of the operations which increase the size of the list.
If not, what does that sentence means?
In principle, pre-allocating the list could reduce the running time by some constant factor, by avoiding multiple re-allocations. This doesn't mean the complexity is lower, but it may mean the code is faster in practice. If in doubt, benchmark or profile the relevant part of your code to compare the two options; in most circumstances it won't matter, and when it does, there are likely to be better alternatives anyway (e.g. NumPy arrays) for achieving the same goal.
new_list could potentially require his allocated memory to change an incredible number of times
List reallocation follows a geometric progression, so if the final length of the list is n then the list is reallocated only O(log n) times along the way; not an "incredible number of times". The way the maths works out, the average number of times each element gets copied to a new underlying array is a constant regardless of how large the list gets, hence the O(1) amortised cost of appending to the list.

amortized analysis and one basic question (is there any simple example)?

I see two sentences:
total amortized cost of a sequence of operations must be an upper
bound on the total actual cost of the sequence
When assigning amortized costs to operations on a data structure, you
need to ensure that, for any sequence of operations performed, that
the sum of the amortized costs is always at least as big as the sum of
the actual costs of those operations.
my challenge is two things:
A) both of them meaning: amortized cost >= Real Cost of operation? I think amortized is (n* real cost).
B) is there any example to more clear me to understand? a real and short example?
The problem that amortization solves is that common operations may trigger occasional slow ones. Therefore if we add up the worst cases, we are effectively looking at how the program would perform if garbage collection is always running and every data structure had to be moved in memory every time. But if we ignore the worst cases, we are effectively ignoring that garbage collection sometimes does run, and large lists sometimes do run out of allocated space and have to be moved to a bigger bucket.
We solve this by gradually writing off occasional big operations over time. We write it off as soon as we realize that it may be needed some day. Which means that the amortized cost is usually bigger than the real cost, because it includes that future work, but occasionally the real cost is way bigger than the amortized. And, on average, they come out to around the same.
The standard example people start with is a list implementation where we allocate 2x the space we currently need, and then reallocate and move it if we use up space. When I run foo.append(...) in this implementation, usually I just insert. But occasionally I have to copy the whole large list. However if I just copied and the list had n items, after I append n times I will need to copy 2n items to a bigger space. Therefore my amortized analysis of what it costs to append includes the cost of an insert and moving 2 items. And over the next n times I call append it my estimate exceeds the real cost n-1 times and is less the nth time, but averages out exactly right.
(Python's real list implementation works like this except that the new list is around 9/8 the size of the old one.)

Why filling an array from the front is slow?

In the chapter on Arrays in the book Elements of Programming Interviews in Python, it is mentioned that Filling an array from the front is slow, so see if it’s possible to write values from the back.
What could be the possible reason for that?
Python lists, at least in CPython, the standard Python implementation, are actually implemented from a data structure perspective as arrays, not lists.
However, these are dynamically allocated and resized, so appending to the end of a Python-list is actually possible. It takes a somewhat variable amount of time to do so: CPython tries to allocate additional space when items are being appended beyond what is actually necessary, such that it doesn't need to allocate more space for every append operation. At best, appending, if space has already been allocated, is O(1), and since it is an array, indexing is also O(1).
What will take a long time, however, is adding something to the beginning of a list, as this would require shifting all the array values, and is O(n), just as popping the first element is.
Python language designers have decided to call these arrays lists instead of arrays, contradicting standard terminology, in part, I assume, because the dynamic resizing makes them different from standard, fixed-size lists.
Unless I'm mistaken, collections.deque implements a doubly-linked list, with the corresponding O(1) appends/pops on either side, and so on.

Why is deque implemented as a linked list instead of a circular array?

CPython deque is implemented as a doubly-linked list of 64-item sized "blocks" (arrays). The blocks are all full, except for the ones at either end of the linked list. IIUC, the blocks are freed when a pop / popleft removes the last item in the block; they are allocated when append/appendleft attempts to add a new item and the relevant block is full.
I understand the listed advantages of using a linked list of blocks rather than a linked list of items:
reduce memory cost of pointers to prev and next in every item
reduce runtime cost of doing malloc/free for every item added/removed
improve cache locality by placing consecutive pointers next to each other
But why wasn't a single dynamically-sized circular array used instead of the doubly-linked list in the first place?
AFAICT, the circular array would preserve all the above advantages, and maintain the (amortized) cost of pop*/append* at O(1) (by overallocating, just like in list). In addition, it would improve the cost of lookup by index from the current O(n) to O(1). A circular array would also be simpler to implement, since it can reuse much of the list implementation.
I can see an argument in favor of a linked list in languages like C++, where removal of an item from the middle can be done in O(1) using a pointer or iterator; however, python deque has no API to do this.
Adapted from my reply on the python-dev mailing list:
The primary point of a deque is to make popping and pushing at both ends efficient. That's what the current implementation does: worst-case constant time per push or pop regardless of how many items are in the deque. That beats "amortized O(1)" in the small and in the large. That's why it was done this way.
Some other deque methods are consequently slower than they are for lists, but who cares? For example, the only indices I've ever used with a deque are 0 and -1 (to peek at one end or the other of a deque), and the implementation makes accessing those specific indices constant-time too.
Indeed, the message from Raymond Hettinger referenced by Jim Fasarakis Hilliard in his comment:
https://www.mail-archive.com/python-dev#python.org/msg25024.html
confirms that
The only reason that __getitem__ was put in was to support fast access to the head and tail without actually popping the value
In addition to accepting #TimPeters answer, I wanted to add a couple additional observations that don't fit into a comment format.
Let's just focus on a common use case where deque is used as a simple a FIFO queue.
Once the queue reaches its peak size, the circular array need no more allocations of memory from the heap. I thought of it as an advantage, but it turns out the CPython implementation achieved the same by keeping a list of reusable memory blocks. A tie.
While the queue size is growing, both the circular array and the CPython need memory from the heap. CPython needs a simple malloc, while the array needs the (potentially much more expensive) realloc (unless extra space happens to be available on the right edge of the original memory block, it needs to free the old memory and copy the data over). Advantage to CPython.
If the queue peaked out at a much larger size than its stable size, both CPython and the array implementation would waste the unused memory (the former by saving it in a reusable block list, the latter by leaving the unused empty space in the array). A tie.
As #TimPeters pointed out, the cost of each FIFO queue put / get is always O(1) for CPython, but only amortized O(1) for the array. Advantage to CPython.

Categories