Timsort's merge pattern

Timsort's merge pattern - python

I would like to understand Timsort. Including all details, so an answer like "just call .sort() in Java or in Python" is not what I'm looking for.
I've already read the Wikipedia article, the Java implementation, the CPython implementation, the listsort.txt that these sources mention, and the publication listsort.txt references: Nearly-Optimal Mergesorts: Fast, Practical Sorting Methods That Optimally Adapt to Existing Runs. I've also skimmed through some of the references of Nearly-Optimal Mergesorts, namely Scaling and Related Techniques for Geometry Problems, A Unifying Look at Data Structures, and Fast Algorithms for Finding Nearest Common Ancestors; I wouldn't say I understood the latter three in full depth but I'm quite sure that they don't answer my question.
I grok most sub-algorithms: run counting, and reversing runs, binary insertion sort, merging (saving the head / tail part), galloping. Please, do not try to explain these, I know what, how, and why they do.
What I miss is the merge pattern. Actually, I understand how it works by maintaining a stack of runs and deciding when to merge based on repeatedly checking invariants; I also see that the goals are: adapting to natural runs, achieving sorting stability, balancing merges, exploiting temporal locality and keeping the algorithm simple.
But I don't get why reestablishing these invariants results in an efficient algorithm.
The file listsort.txt states that (line 346):
The code now uses the "powersort" merge strategy from: "Nearly-Optimal
Mergesorts: Fast, Practical Sorting Methods That Optimally Adapt to
Existing Runs" J. Ian Munro and Sebastian Wild
I understand how Munro's and Wild's powersort works, and I find their explanation involving "nearly-optimal alphabetic trees" with "bisection heurestic" and "movement along the edges of Cartesian trees" sufficient.
What I can't grasp is the link between powersort's and Timsort's merge pattern.
Clearly, they are different:
powersort considers node powers, while Timsort takes into account runlengths,
powersort has one invariant: B <= A, while Timsort has two: Y > X and Z > Y + X (where X is the most recently read run, which has the highest starting index in the array and is on the top or will be placed on the top of the stack, and A is the node between runs Y and X),
powersort always merges Y and X, while Timsort merges either Y and X or Z and Y.
I see that a higher node power is usually between shorter runs but I cannot deduce one algorithm from the other. Please explain me what is the connection between the the two.
Once I understand this, I hope I will also find the justification of the stack capacities:
/*
* Allocate runs-to-be-merged stack (which cannot be expanded). The
* stack length requirements are described in listsort.txt. The C
* version always uses the same stack length (85), but this was
* measured to be too expensive when sorting "mid-sized" arrays (e.g.,
* 100 elements) in Java. Therefore, we use smaller (but sufficiently
* large) stack lengths for smaller arrays. The "magic numbers" in the
* computation below must be changed if MIN_MERGE is decreased. See
* the MIN_MERGE declaration above for more information.
*/
int stackLen = (len < 120 ? 5 :
len < 1542 ? 10 :
len < 119151 ? 19 : 40);
How were these numbers calculated? Why is it sure that the stack won't grow beyond these? Why doesn't the code check if there is a free slot? If there isn't, then (a) expanding the stack or (b) merging at the top would easily prevent an ArrayOutOfBounds exception; but I didn't find anything like that in the code.

Please explain me what is the connection between the the two.
They instantiate a common framework: split the input list into runs, merge adjacent runs until there is only one. The different ways to do this can be summarized by binary trees whose leaves are the runs. Timsort, powersort, and peeksort (from the same paper as powersort) represent three different tree construction methods.
How were these numbers calculated?
Wrongly. (In case the link dies, that's de Gouw, Rot, de Boer, Bubel, and Hähnle: "OpenJDK's java.utils.Collection.sort() is broken: The good, the bad and the worst case".)

Related

How to find the largest linked grouping of sub-graphs?

I want to determine the largest contiguous (if that’s the right word) graph consisting of a bunch of sub-graphs. I define two sub-graphs as being contiguous if any of the nodes between the two sub-graphs are linked.
My initial solution to this is very slow and cumbersome and stupid – just to look at each sub-graph, see if it’s linked to any of the other sub-graphs, and do the analysis for all of the sub-graphs to find the largest number of linked sub-graphs. That’s just me coming from a Fortran background. Is there a better way to do it – a pythonic way, even a graph theory way? I imagine this is a standard question in network science.

A good starting point to answer the kind of question you've asked is to look at a merge-find (or disjoint-set) approach (https://en.wikipedia.org/wiki/Disjoint-set_data_structure).
It's offers an efficient algorithm (at least on an amortized basis) to identify which members of a collection of graphs are disjoint and which aren't.
Here are a couple of related questions that have pointers to additional resources about this algorithm (also know as "union-find"):
Union find implementation using Python
A set union find algorithm
You can get quite respectable performance by merging two sets using "union by rank" as summarized in the Wikipedia page (and the pseudocode provided therein):
For union by rank, a node stores its rank, which is an upper bound for its height. When a node is initialized, its rank is set to zero. To merge trees with roots x and y, first compare their ranks. If the ranks are different, then the larger rank tree becomes the parent, and the ranks of x and y do not change. If the ranks are the same, then either one can become the parent, but the new parent's rank is incremented by one. While the rank of a node is clearly related to its height, storing ranks is more efficient than storing heights. The height of a node can change during a Find operation, so storing ranks avoids the extra effort of keeping the height correct.
I believe there may be even more sophisticated approaches, but the above union-by-rank implementation is what I have used in the past.

The difference in application between SequenceMatcher in edit distance and that in difflib?

I know the implementation of the edit distance algorithm. By dynamic programming, we first fill the first column and first row and then the entries immediately right and below of the filled entries by comparing three paths from the left, above, and left above. While for the Ratcliff/Obershelp algorithm, we first extract the longest common substring out from the two strings, then we do recursive operations for the left side two sub-strings and right side two sub-strings until no characters are left.
Both of them can be utilized to calculate the similarity between two strings and transform one string into another using four operations: delete, replace, copy, and insert.
But I wonder when to use which between SequenceMatcher in edit distance and that in difflib?
Here is what I found on the internet that makes me think that this question would also benefit others:
In the documentation of edit distance it reads that
Similar to the difflib SequenceMatcher, but uses Levenshtein/edit distance.
In this answer to a question on calculating edit distance, an answer on Ratcliff/Obershelp algorithm was provided.
There are only a few resources about the Ratcliff/Obershelp algorithm, let alone its comparison to edit distance that I thought is the most well known string alignment algorithm.
So far as I know, I have the following ideas:
I find that edit distance and the Ratcliff/Obershelp algorithm can both be used for spell checking. But when to use which?
I thought the edit distance is employed to find the minimal edit sequence while the Ratcliff/Obershelp algorithm yields matches that "look right" to people. However, 'look right' seems too vague a term, especially in real world applications. What's more, when is the minimum edit sequence a must/preferred?
Any suggestions would be highly appreciated, and thanks in advance.

"Looks right to people" needn't be all that vague. Search the web for discussion of why, e.g., the very widely used git source control system added "patience" and "histogram" differencing algorithms, as options. Variations of "minimal edit distance" routinely produce diffs that are jarring to humans, and I'm not going to reproduce examples here that are easily found by searching.
From a formal perspective, Levenshtein is more in line with what a mathematician means by "distance". Chiefly, difflib's .ratio() can depend on the order of the arguments passed to it, but Levenshtein is insensitve to order:
>>> import difflib
>>> difflib.SequenceMatcher(None, "tide", "diet").ratio()
0.25
>>> difflib.SequenceMatcher(None, "diet", "tide").ratio()
0.5
For the rest, I don't think you're going to get crisp answers. There are many notions of "similarity", not just the two you mentioned, and they all have their fans. "Minimal" was probably thought to be more important back when disk space and bandwidth were scarce and expensive.
The physical realities constraining genetic mutation have made measures that take into account spatial locality much more important in that field - doesn't matter one whit if it's "minimal" if it's also physically implausible ;-) Terms to search for: Smith–Waterman, and Needleman–Wunsch.

Minimizing entries in a many to many hashtable

I've run into an interesting problem, where I need to make a many-to-many hash with a minimized number of entries. I'm working in python, so that comes in the form of a dictionary, but this problem would be equally applicable in any language.
The data initially comes in as input of one key to one entry (representing one link in the many-to-many relationship).
So like:
A-1, B-1, B-2, B-3, C-2, C-3
A simple way of handling the data would be linking them one to many:
A: 1
B: 1,2,3
C: 2,3
However the number of entries is the primary computational cost for a later process, as a file will need to be generated and sent over the internet for each entry (that is a whole other story), and there would most likely be thousands of entries in the one-to-many implementation.
Thus a more optimized hash would be:
[A, B]: 1
[B, C]: 2,3
This table would be discarded after use, so maintainability is not a concern, the only concern is the time-complexity of reducing the entries (the time it takes the algorithm to reduce the entries must not exceed the time the algorithm would save in reducing the entries from the baseline one-to-many table).
Now, I'm pretty sure that at least someone has faced this problem, this seems like a problem straight out of my Algorithms class in college. However, I'm having trouble finding applicable algorithms, as I can't find the right search terms. I'm about to take a crack at making an algorithm for this from scratch, but I figured it wouldn't hurt to ask around to see if people can't identify this as a problem commonly solved by a modified [insert well-known algorithm here].
I personally think it's best to start by creating a one-to-many hash and then examining subsets of the values in each entry, creating an entry in the solution hash for the maximum identified set of shared values. But I'm unsure how to guarantee a smaller number of subsets than just the one-to-many baseline implementation.

Let's go back to your your unoptimised dictionary of letters to sets of numbers:
A: 1
B: 1,2,3
C: 2,3
There's a - in this case 2-branch - tree of refactoring steps you could do:
A:1 B:1,2,3 C:2,3
/ \
factor using set 2,3 factor using set 1
/ \
A:1 B:1 B,C:2,3 A,B:1 B:2,3 C:2,3
/ \
factor using set 1 factor using set 2,3
/ \
A,B:1 B,C:2,3 A,B:1 B,C:2,3
In this case at least, you arrive at the same result regardless of which factoring you do first, but I'm not sure if that would always be the case.
Doing an exhaustive exploration of the tree sounds expensive, so you might want to avoid that, but if we could pick the optimal path, or at least a likely-good path, it would be relatively low computational cost. Rather than branching at random, my guy instinct is that it'd be faster and closer to optimal if you tried to make the largest-set factoring change possible at each point in the tree. For example, considering the two-branch tree above, you'd prefer the initial 2,3 factoring over the initial 1 factoring in your tree, because 2,3 has the larger set of size two. More dramatic refactoring suggest the number of refactorings before you get a stable result will be less.
What that amounts to is iterating the sets from largest towards smallest (it doesn't matter which order you iterate over same-length sets), looking for refactoring opportunities.
Much like a bubble-sort, after each refactoring the approach would be "I made a change; it's not stable yet; let's repeat". Restart by iterating from second-longest sets towards shortest sets, checking for optimisation opportunities as you go.
(I'm not sure about python, but in general set comparisons can be expensive, you might want to maintain a value for each set that is the XOR of the hashed values in the set - that's easy and cheap to update if a few set elements are changed, and a trivial comparison can tell you large sets are unequal, saving comparison time; it won't tell you if sets are equal though: multiple sets could have the same XOR-of-hashes value).

What is the runtime complexity (big O) of the following pseudocode?

I recently had a very, very intense debate about the runtime complexity of a super simple algorithm with a colleague of mine. In the end we both agreed to disagree but as I've been thinking about this, it's challenged my basic understanding of computer science fundamentals and so I therefore must get additional insight on the matter.
Given the following python, what is the Big-O runtime complexity:
for c in "How are you today?":
print c
Now, I immediately called out that this is simply on the order of O(n) aka linear. Meaning it's dependent on the length of the string so therefore this loop will grow linearly as the length of the string grows.
My colleague then said, "No, it's constant because we know that for the set of all strings we are dealing with (in our case), the max string is always 255 characters long (in our case), therefore it must be constant." He followed on by saying "because we have a max upper-bound on character length of the string this results in O(255) which reduces to O(1)."
Anyways, we went back and fourth and after 45 minutes of both of us drawing sketches we both dead-locked on the issue.
My question is in what world or what math system is the loop above a constant time loop? If we knew our upper-bound was say 1,000,000 characters and the set of all strings could be anywhere from 0 to 1,000,000 this loop will obviously exhibit linear running times depending on the size of the string.
I additionally asked him if he also thinks the following code is O(1) if the upper-bound size of n is known. Meaning we are certain this code will only ever operate on a max upper-bound of say 255 characters:
s = "How are you today?"
for c in s:
for d in s:
print c+d
He said this is also constant time....even after I explained this is an O(n^2) algorithm and demonstrated that the following code would produce a quadratic curve.
So, am I missing some theoretical concept where any of the above is true depending on how the theory goes? Just to be clear his understanding is that I am correct if n is not known. If the upper-bound of n is always known he is asserting that the two algorithms on this post are both of constant runtime complexity.
Just looking to maintain my sanity, but perhaps if I'm wrong there's certainly some additional learning I can benefit from. My good, good colleague was very convincing. Also, if anybody has additional links or material on the subject specific to this question please add to the comments.

Applying Big-O notation to a single scenario in which all the inputs are known is ludicrous. There is no Big-O for a single case.
The whole point is to get a worst-case estimate for arbitrarily large, unknown values of n. If you already know the exact answer, why on Earth would you waste time trying to estimate it?
Mathy / Computer-Sciencey Edit:
Big-O notation is defined as n grows arbitrarily large: f(n) is O(g(n)) if g(n) ≥ c * f(n), for any constant c, for all n greater than some nMin. Meaning, your "opponent" can set c to "eleventy-quadjillion" and it doesn't matter, because, for all points "to the right" of some point nMin, the graph of "eleventy-quadjillion times f(n)" will lag below g(n)... forever.
Example: 2n is less than or equal to n2... for a short segment of the x-axis that includes n = 2, 3, and 4 (at n = 3, 2n is 8, while n2 is 9). This doesn't change the fact that their Big-O relationship is the opposite: O(2n) is much greater than O(n2), because Big-O says nothing about n values less than nMin. If you set nMin to 4 (thus ignoring the graph to the left of 4), you'll see that the n2 line never exceeds the 2n line.
If your "opponent" multiplies n2 by some larger constant c to raise "his" n2 line above your 2n line, you haven't lost yet... you just slide nMin to the right a bit. Big-O says that no matter how big he makes c, you can always find a point after which his equation loses and yours wins, forever.
But, if you constrain n on the right, you've violated the prerequisites for any kind of Big-O analysis. In your argument with your co-worker, one of you invented an nMax, and then the other set nMin somewhere to the right of it --- surprise, the results are nonsensical.
For instance, the first algorithm you showed does indeed do about n work for inputs of length n... in the general case. If I were building my own algorithm that called it n times, I would have to consider mine a quadratic O(n2) algorithm... again, in the general case.
But if I could prove that I would never call your algorithm with an input greater than say 10 (meaning I had more information, and could thus estimate my algorithm more precisely), using Big-O to estimate your algorithm's performance would be throwing away what I'd learned about its actual behavior in the case I care about. I should instead replace your algorithm with a suitably large constant --- changing my algorithm from c * n2 to c * 10 * n... which is just cBigger * n. I could honestly claim my algorithm is linear, because in this case, your algorithm's graph will never rise above that constant value. This would change nothing about the Big-O performance of your algorithm, because Big-O is not defined for constrained cases like this.
To wrap up: In general, that first algorithm you showed was linear by Big-O standards. In a constrained case, where the maximum input is known, it is a mistake to speak of it in Big-O terms at all. In a constrained case, it could legitimately be replaced by some constant value when discussing the Big-O behavior of some other algorithm, but that says absolutely nothing about the Big-O behavior of the first algorithm.
In conclusion: O(Ackermann(n)) works fine when nMax is small enough. Very, very small enough...

In your case...
I am tempted to say that your friend is softly wrong. And that's because of the considerably big additional constant of 256 in O(1) run time. Your friend said that the execution was O(256). And because we ignore the constants in Big-O, we simply call O(256 * 1) as O(1). It is up to you to decide whether this constant is negligible for you or not.
I have two strong reasons to say that you are right:
Firstly, for various values of n, your answer of O(n) (in first code) gives a better approximation of the running-time. For example:
For a string of length 4: you say run-time is proportional to 4, while you friend says it is proportional to 1 (or 256).
For string of length 255: you say the running time is proportional to 255 while your friend again says that it is constant time.
Clearly, your answer is more accurate in every case, even though his answer is not outright wrong.
Secondly, if you go by your friend's method, then in one sense you can cheat and say that since no string can go beyond your RAM + disk size, therefore all processing is in O(1). And that's when the fallacy of your friend's reasoning becomes visible. Yes he is right that running time (assuming 1TB hard disk and 8 GB RAM) is O((1TB + 8GB) *1) = O(1), but you simply cannot ignore the size of your constant in this scenario.
The Big-O complexity does not tell the actual time of execution, but just the simplistic rate of growth of the running time as the value of n increases.

I think you're both right.
The runtime of the first algorithm is linear in the size of its input. However, if its input is fixed, then its runtime is also fixed.
Big O is all about measuring the behavior of an algorithm as its input changes. If the input never changes, then Big O is meaningless.
Also: O(n) means that the upper bound of complexity is N. If you want to represent a tight bound then the more precise notation is Θ(n) (theta notation).

You're both right in a way, but you're more right than your colleague. (EDIT: Nope. On further thought, you're right and your colleage is wrong. See my comment below.) The question really isn't whether N is known, but whether N can change. Is s the input to your algorithm? Then it's O(N) or O(N^2): you know the value of N for this particular input, but a different input would have a different value, so knowing N for this input isn't relevant.
Here's the difference in your two approaches. You're treating this code as if it looked like this:
def f(s):
for c in s:
print c
f("How are you today?")
But your colleague is treating it like this:
def f(some_other_input):
for c in "How are you today?":
print c
f("A different string")
In the latter case, that for loop should be considered O(1), because it's not going to change with different inputs. In the former case, the algorithm is O(N).

Find the 'pits' in a list of integers. Fast

I'm attempting to write a program which finds the 'pits' in a list of
integers.
A pit is any integer x where x is less than or equal to the integers
immediately preceding and following it. If the integer is at the start
or end of the list it is only compared on the inward side.
For example in:
[2,1,3] 1 is a pit.
[1,1,1] all elements are pits.
[4,3,4,3,4] the elements at 1 and 3 are pits.
I know how to work this out by taking a linear approach and walking along
the list however i am curious about how to apply divide and conquer
techniques to do this comparatively quickly. I am quite inexperienced and
am not really sure where to start, i feel like something similar to a binary
tree could be applied?
If its pertinent i'm working in Python 3.
Thanks for your time :).

Without any additional information on the distribution of the values in the list, it is not possible to achieve any algorithmic complexity of less than O(x), where x is the number of elements in the list.
Logically, if the dataset is random, such as a brownian noise, a pit can happen anywhere, requiring a full 1:1 sampling frequency in order to correctly find every pit.
Even if one just wants to find the absolute lowest pit in the sequence, that would not be possible to achieve in sub-linear time without repercussions on the correctness of the results.
Optimizations can be considered, such as mere parallelization or skipping values neighbor to a pit, but the overall complexity would stay the same.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.