I want to determine the largest contiguous (if that’s the right word) graph consisting of a bunch of sub-graphs. I define two sub-graphs as being contiguous if any of the nodes between the two sub-graphs are linked.
My initial solution to this is very slow and cumbersome and stupid – just to look at each sub-graph, see if it’s linked to any of the other sub-graphs, and do the analysis for all of the sub-graphs to find the largest number of linked sub-graphs. That’s just me coming from a Fortran background. Is there a better way to do it – a pythonic way, even a graph theory way? I imagine this is a standard question in network science.
A good starting point to answer the kind of question you've asked is to look at a merge-find (or disjoint-set) approach (https://en.wikipedia.org/wiki/Disjoint-set_data_structure).
It's offers an efficient algorithm (at least on an amortized basis) to identify which members of a collection of graphs are disjoint and which aren't.
Here are a couple of related questions that have pointers to additional resources about this algorithm (also know as "union-find"):
Union find implementation using Python
A set union find algorithm
You can get quite respectable performance by merging two sets using "union by rank" as summarized in the Wikipedia page (and the pseudocode provided therein):
For union by rank, a node stores its rank, which is an upper bound for its height. When a node is initialized, its rank is set to zero. To merge trees with roots x and y, first compare their ranks. If the ranks are different, then the larger rank tree becomes the parent, and the ranks of x and y do not change. If the ranks are the same, then either one can become the parent, but the new parent's rank is incremented by one. While the rank of a node is clearly related to its height, storing ranks is more efficient than storing heights. The height of a node can change during a Find operation, so storing ranks avoids the extra effort of keeping the height correct.
I believe there may be even more sophisticated approaches, but the above union-by-rank implementation is what I have used in the past.
Related
I would like to understand Timsort. Including all details, so an answer like "just call .sort() in Java or in Python" is not what I'm looking for.
I've already read the Wikipedia article, the Java implementation, the CPython implementation, the listsort.txt that these sources mention, and the publication listsort.txt references: Nearly-Optimal Mergesorts: Fast, Practical Sorting Methods That Optimally Adapt to Existing Runs. I've also skimmed through some of the references of Nearly-Optimal Mergesorts, namely Scaling and Related Techniques for Geometry Problems, A Unifying Look at Data Structures, and Fast Algorithms for Finding Nearest Common Ancestors; I wouldn't say I understood the latter three in full depth but I'm quite sure that they don't answer my question.
I grok most sub-algorithms: run counting, and reversing runs, binary insertion sort, merging (saving the head / tail part), galloping. Please, do not try to explain these, I know what, how, and why they do.
What I miss is the merge pattern. Actually, I understand how it works by maintaining a stack of runs and deciding when to merge based on repeatedly checking invariants; I also see that the goals are: adapting to natural runs, achieving sorting stability, balancing merges, exploiting temporal locality and keeping the algorithm simple.
But I don't get why reestablishing these invariants results in an efficient algorithm.
The file listsort.txt states that (line 346):
The code now uses the "powersort" merge strategy from: "Nearly-Optimal
Mergesorts: Fast, Practical Sorting Methods That Optimally Adapt to
Existing Runs" J. Ian Munro and Sebastian Wild
I understand how Munro's and Wild's powersort works, and I find their explanation involving "nearly-optimal alphabetic trees" with "bisection heurestic" and "movement along the edges of Cartesian trees" sufficient.
What I can't grasp is the link between powersort's and Timsort's merge pattern.
Clearly, they are different:
powersort considers node powers, while Timsort takes into account runlengths,
powersort has one invariant: B <= A, while Timsort has two: Y > X and Z > Y + X (where X is the most recently read run, which has the highest starting index in the array and is on the top or will be placed on the top of the stack, and A is the node between runs Y and X),
powersort always merges Y and X, while Timsort merges either Y and X or Z and Y.
I see that a higher node power is usually between shorter runs but I cannot deduce one algorithm from the other. Please explain me what is the connection between the the two.
Once I understand this, I hope I will also find the justification of the stack capacities:
/*
* Allocate runs-to-be-merged stack (which cannot be expanded). The
* stack length requirements are described in listsort.txt. The C
* version always uses the same stack length (85), but this was
* measured to be too expensive when sorting "mid-sized" arrays (e.g.,
* 100 elements) in Java. Therefore, we use smaller (but sufficiently
* large) stack lengths for smaller arrays. The "magic numbers" in the
* computation below must be changed if MIN_MERGE is decreased. See
* the MIN_MERGE declaration above for more information.
*/
int stackLen = (len < 120 ? 5 :
len < 1542 ? 10 :
len < 119151 ? 19 : 40);
How were these numbers calculated? Why is it sure that the stack won't grow beyond these? Why doesn't the code check if there is a free slot? If there isn't, then (a) expanding the stack or (b) merging at the top would easily prevent an ArrayOutOfBounds exception; but I didn't find anything like that in the code.
Please explain me what is the connection between the the two.
They instantiate a common framework: split the input list into runs, merge adjacent runs until there is only one. The different ways to do this can be summarized by binary trees whose leaves are the runs. Timsort, powersort, and peeksort (from the same paper as powersort) represent three different tree construction methods.
How were these numbers calculated?
Wrongly. (In case the link dies, that's de Gouw, Rot, de Boer, Bubel, and Hähnle: "OpenJDK's java.utils.Collection.sort() is broken: The good, the bad and the worst case".)
I have a rather simple problem to define but I did not find a simple answer so far.
I have two graphs (ie sets of vertices and edges) which are identical. Each of them has independently labelled vertices. Look at the example below:
How can the computer detect, without prior knowledge of it, that 1 is identical to 9, 2 to 10 and so on?
Note that in the case of symmetry, there may be several possible one to one pairings which give complete equivalence, but just finding one of them is sufficient to me.
This is in the context of a Python implementation. Does someone have a pointer towards a simple algorithm publicly available on the Internet? The problem sounds simple but I simply lack the mathematical knowledge to come up to it myself or to find proper keywords to find the information.
EDIT: Note that I also have atom types (ie labels) for each graphs, as well as the full distance matrix for the two graphs to align. However the positions may be similar but not exactly equal.
This is known as the graph isomorphism problem, and probably very hard; although the exactly details of how hard are still subject of research.
(But things look better if you graphs are planar.)
So, after searching for it a bit, I think that I found a solution that works most of the time for moderate computational cost. This is a kind of genetic algorithm which uses a bit of randomness, but it is practical enough for my purposes it seems. I didn't have any aberrant configuration with my samples so far even if it is theoretically possible that this happens.
Here is how I proceeded:
Determine the complete set of 2-paths, 3-paths and 4-paths
Determine vertex types using both atom type and surrounding topology, creating an "identity card" for each vertex
Do the following ten times:
Start with a random candidate set of pairings complying with the allowed vertex types
Evaluate how much of 2-paths, 3-paths and 4-paths correspond between the two pairings by scoring one point for each corresponding vertex (also using the atom type as an additional descriptor)
Evaluate all other shortlisted candidates for a given vertex by permuting the pairings for this candidate with its other positions in the same way
Sort the scores in descending order
For each score, check if the configuration is among the excluded configurations, and if it is not, take it as the new configuration and put it into the excluded configurations.
If the score is perfect (ie all of the 2-paths, 3-paths and 4-paths correspond), then stop the loop and calculate the sum of absolute differences between the distance matrices of the two graphs to pair using the selected pairing, otherwise go back to 4.
Stop this process after it has been done 10 times
Check the difference between distance matrices and take the pairings associated with the minimal sum of absolute differences between the distance matrices.
I've run into an interesting problem, where I need to make a many-to-many hash with a minimized number of entries. I'm working in python, so that comes in the form of a dictionary, but this problem would be equally applicable in any language.
The data initially comes in as input of one key to one entry (representing one link in the many-to-many relationship).
So like:
A-1, B-1, B-2, B-3, C-2, C-3
A simple way of handling the data would be linking them one to many:
A: 1
B: 1,2,3
C: 2,3
However the number of entries is the primary computational cost for a later process, as a file will need to be generated and sent over the internet for each entry (that is a whole other story), and there would most likely be thousands of entries in the one-to-many implementation.
Thus a more optimized hash would be:
[A, B]: 1
[B, C]: 2,3
This table would be discarded after use, so maintainability is not a concern, the only concern is the time-complexity of reducing the entries (the time it takes the algorithm to reduce the entries must not exceed the time the algorithm would save in reducing the entries from the baseline one-to-many table).
Now, I'm pretty sure that at least someone has faced this problem, this seems like a problem straight out of my Algorithms class in college. However, I'm having trouble finding applicable algorithms, as I can't find the right search terms. I'm about to take a crack at making an algorithm for this from scratch, but I figured it wouldn't hurt to ask around to see if people can't identify this as a problem commonly solved by a modified [insert well-known algorithm here].
I personally think it's best to start by creating a one-to-many hash and then examining subsets of the values in each entry, creating an entry in the solution hash for the maximum identified set of shared values. But I'm unsure how to guarantee a smaller number of subsets than just the one-to-many baseline implementation.
Let's go back to your your unoptimised dictionary of letters to sets of numbers:
A: 1
B: 1,2,3
C: 2,3
There's a - in this case 2-branch - tree of refactoring steps you could do:
A:1 B:1,2,3 C:2,3
/ \
factor using set 2,3 factor using set 1
/ \
A:1 B:1 B,C:2,3 A,B:1 B:2,3 C:2,3
/ \
factor using set 1 factor using set 2,3
/ \
A,B:1 B,C:2,3 A,B:1 B,C:2,3
In this case at least, you arrive at the same result regardless of which factoring you do first, but I'm not sure if that would always be the case.
Doing an exhaustive exploration of the tree sounds expensive, so you might want to avoid that, but if we could pick the optimal path, or at least a likely-good path, it would be relatively low computational cost. Rather than branching at random, my guy instinct is that it'd be faster and closer to optimal if you tried to make the largest-set factoring change possible at each point in the tree. For example, considering the two-branch tree above, you'd prefer the initial 2,3 factoring over the initial 1 factoring in your tree, because 2,3 has the larger set of size two. More dramatic refactoring suggest the number of refactorings before you get a stable result will be less.
What that amounts to is iterating the sets from largest towards smallest (it doesn't matter which order you iterate over same-length sets), looking for refactoring opportunities.
Much like a bubble-sort, after each refactoring the approach would be "I made a change; it's not stable yet; let's repeat". Restart by iterating from second-longest sets towards shortest sets, checking for optimisation opportunities as you go.
(I'm not sure about python, but in general set comparisons can be expensive, you might want to maintain a value for each set that is the XOR of the hashed values in the set - that's easy and cheap to update if a few set elements are changed, and a trivial comparison can tell you large sets are unequal, saving comparison time; it won't tell you if sets are equal though: multiple sets could have the same XOR-of-hashes value).
I am new to sage and I have been reading the documentation but this is very new territory for me and it is a bit tough to get a good grasp on everything.
What I want to do is, given an adjacency matrix, an upper bounds, and a lower bounds - generate all pathways through that matrix, where a "pathway" is comprised of one entry from each row, such that the weight of the pathway is equal to or between the bounds.
Even better would be if I could organize the given pathways by 1.) The amount amount of edges with a lower weight in each row than the entry in the pathway, and/or 2.) Minimum overlap with other pathways in regards to #1.
For clarity, a quick example.
Given the 4x4 matrix:
[[1,2,3,4],
[5,6,7,8],
[10,11,12,13],
[20,21,22,23]]
And an upper bound 38, and a lower bound 37, possible pathways could be:
2,5,10,20
3,5,10,20
2,6,10,20
2,5,11,20
2,5,10,21
etc etc. I don't want to write out all the pathways, so hopefully you get the idea.
Even better would be if I could quickly filter out redundancy, by not including pathways that are subsets of other pathways (for example, 2,5,10,20 is encompassed by 3,5,10,20 - since for each pathway I plan on including all lower-weight edges of the each respective row).
If you have a symmetric matrix (with or without diagonal entries nonzero) you can do this.
M = matrix([[0,2,3,4],[2,0,7,8], [3,7,0,13], [4,8,13,0]])
G = Graph(M,format='weighted_adjacency_matrix')
G.graphplot(edge_labels=True,spring=True).show()
And hopefully from G itself you could do what you want. (Unless it has nothing to do with graphs and only the matrix, in which case you may want a different thing entirely.)
I'm not sure exactly what you are trying to do (how do the "pathways" correspond to subgraphs?), and probably describing that would be beyond the scope of this site (as opposed to math.SX.com), but the generic graph documentation has some path methods and the undirected documentation may also prove helpful.
I have a tree. It has a flat bottom. We're only interested in the bottom-most leaves, but this is roughly how many leaves there are at the bottom...
2 x 1600 x 1600 x 10 x 4 x 1600 x 10 x 4
That's ~13,107,200,000,000 leaves? Because of the size (the calculation performed on each leaf seems unlikely to be optimised to ever take less than one second) I've given up thinking it will be possible to visit every leaf.
So I'm thinking I'll build a 'smart' leaf crawler which inspects the most "likely" nodes first (based on results from the ones around it). So it's reasonable to expect the leaves to be evaluated in branches/groups of neighbours, but the groups will vary in size and distribution.
What's the smartest way to record which leaves have been visited and which have not?
You don't give a lot of information, but I would suggest tuning your search algorithm to help you keep track of what it's seen. If you had a global way of ranking leaves by "likelihood", you wouldn't have a problem since you could just visit leaves in descending order of likelihood. But if I understand you correctly, you're just doing a sort of hill climbing, right? You can reduce storage requirements by searching complete subtrees (e.g., all 1600 x 10 x 4 leaves in a cluster that was chosen as "likely"), and keeping track of clusters rather than individual leaves.
It sounds like your tree geometry is consistent, so depending on how your search works, it should be easy to merge your nodes upwards... e.g., keep track of level 1 nodes whose leaves have all been examined, and when all children of a level 2 node are in your list, drop the children and keep their parent. This might also be a good way to choose what to examine: If three children of a level 3 node have been examined, the fourth and last one is probably worth examining too.
Finally, a thought: Are you really, really sure that there's no way to exclude some solutions in groups (without examining every individual one)? Problems like sudoku have an astronomically large search space, but a good brute-force solver eliminates large blocks of possibilities without examining every possible 9 x 9 board. Given the scale of your problem, this would be the most practical way to attack it.
It seems that you're looking for a quick and efficient ( in terms of memory usage ) way to do a membership test. If so and if you can cope with some false-positives go for a bloom filter.
Bottom line is : Use bloom filters in situations where your data set is really big AND all what you need is checking if a particular element exists in the set AND a small chance of false positives is tolerable.
Some implementation for Python should exist.
Hope this will help.
Maybe this is too obvious, but you could store your results in a similar tree. Since your computation is slow, the results tree should not grow out of hand too quickly. Then just look up if you have results for a given node.