I have a rather simple problem to define but I did not find a simple answer so far.
I have two graphs (ie sets of vertices and edges) which are identical. Each of them has independently labelled vertices. Look at the example below:
How can the computer detect, without prior knowledge of it, that 1 is identical to 9, 2 to 10 and so on?
Note that in the case of symmetry, there may be several possible one to one pairings which give complete equivalence, but just finding one of them is sufficient to me.
This is in the context of a Python implementation. Does someone have a pointer towards a simple algorithm publicly available on the Internet? The problem sounds simple but I simply lack the mathematical knowledge to come up to it myself or to find proper keywords to find the information.
EDIT: Note that I also have atom types (ie labels) for each graphs, as well as the full distance matrix for the two graphs to align. However the positions may be similar but not exactly equal.
This is known as the graph isomorphism problem, and probably very hard; although the exactly details of how hard are still subject of research.
(But things look better if you graphs are planar.)
So, after searching for it a bit, I think that I found a solution that works most of the time for moderate computational cost. This is a kind of genetic algorithm which uses a bit of randomness, but it is practical enough for my purposes it seems. I didn't have any aberrant configuration with my samples so far even if it is theoretically possible that this happens.
Here is how I proceeded:
Determine the complete set of 2-paths, 3-paths and 4-paths
Determine vertex types using both atom type and surrounding topology, creating an "identity card" for each vertex
Do the following ten times:
Start with a random candidate set of pairings complying with the allowed vertex types
Evaluate how much of 2-paths, 3-paths and 4-paths correspond between the two pairings by scoring one point for each corresponding vertex (also using the atom type as an additional descriptor)
Evaluate all other shortlisted candidates for a given vertex by permuting the pairings for this candidate with its other positions in the same way
Sort the scores in descending order
For each score, check if the configuration is among the excluded configurations, and if it is not, take it as the new configuration and put it into the excluded configurations.
If the score is perfect (ie all of the 2-paths, 3-paths and 4-paths correspond), then stop the loop and calculate the sum of absolute differences between the distance matrices of the two graphs to pair using the selected pairing, otherwise go back to 4.
Stop this process after it has been done 10 times
Check the difference between distance matrices and take the pairings associated with the minimal sum of absolute differences between the distance matrices.
Related
I want to determine the largest contiguous (if that’s the right word) graph consisting of a bunch of sub-graphs. I define two sub-graphs as being contiguous if any of the nodes between the two sub-graphs are linked.
My initial solution to this is very slow and cumbersome and stupid – just to look at each sub-graph, see if it’s linked to any of the other sub-graphs, and do the analysis for all of the sub-graphs to find the largest number of linked sub-graphs. That’s just me coming from a Fortran background. Is there a better way to do it – a pythonic way, even a graph theory way? I imagine this is a standard question in network science.
A good starting point to answer the kind of question you've asked is to look at a merge-find (or disjoint-set) approach (https://en.wikipedia.org/wiki/Disjoint-set_data_structure).
It's offers an efficient algorithm (at least on an amortized basis) to identify which members of a collection of graphs are disjoint and which aren't.
Here are a couple of related questions that have pointers to additional resources about this algorithm (also know as "union-find"):
Union find implementation using Python
A set union find algorithm
You can get quite respectable performance by merging two sets using "union by rank" as summarized in the Wikipedia page (and the pseudocode provided therein):
For union by rank, a node stores its rank, which is an upper bound for its height. When a node is initialized, its rank is set to zero. To merge trees with roots x and y, first compare their ranks. If the ranks are different, then the larger rank tree becomes the parent, and the ranks of x and y do not change. If the ranks are the same, then either one can become the parent, but the new parent's rank is incremented by one. While the rank of a node is clearly related to its height, storing ranks is more efficient than storing heights. The height of a node can change during a Find operation, so storing ranks avoids the extra effort of keeping the height correct.
I believe there may be even more sophisticated approaches, but the above union-by-rank implementation is what I have used in the past.
I was given a problem in which you are supposed to write a python code that distributes a number of different weights among 4 boxes.
Logically we can't expect a perfect distribution as in case we are given weights like 10, 65, 30, 40, 50 and 60 kilograms, there is no way of grouping those numbers without making one box heavier than another. But we can aim for the most homogenous distribution. ((60),(40,30),(65),(50,10))
I can't even think of an algorithm to complete this task let alone turn it into python code. Any ideas about the subject would be appreciated.
The problem you're describing is similar to the "fair teams" problem, so I'd suggest looking there first.
Because a simple greedy algorithm where weights are added to the lightest box won't work, the most straightforward solution would be a brute force recursive backtracking algorithm that keeps track of the best solution it has found while iterating over all possible combinations.
As stated in #j_random_hacker's response, this is not going to be something easily done. My best idea right now is to find some baseline. I describe a baseline as an object with the largest value since it cannot be subdivided. Using that you can start trying to match the rest of the data to that value which would only take about three iterations to do. The first and second would create a list of every possible combination and then the third can go over that list and compare the different options by taking the average of each group and storing the closest average value to your baseline.
Using your example, 65 is the baseline and since you cannot subdivide it you know that has to be the minimum bound on your data grouping so you would try to match all of the rest of the values to that. It wont be great, but it does give you something to start with.
As j_random_hacker notes, the partition problem is NP-complete. This problem is also NP-complete by a reduction from the 4-partition problem (the article also contains a link to a paper by Garey and Johnson that proves that 4-partition itself is NP-complete).
In particular, given a list to 4-partition, you could feed that list as an input to a function that solves your box distribution problem. If each box had the same weight in it, a 4-partition would exist, otherwise not.
Your best bet would be to create an exponential time algorithm that uses backtracking to iterate over the 4^n possible assignments. Because unless P = NP (highly unlikely), no polynomial time algorithm exists for this problem.
The Problem
I've been doing a bit of research on Particle Swarm Optimization, so I said I'd put it to the test.
The problem I'm trying to solve is the Balanced Partition Problem - or reduced simply to the Subset Sum Problem (where the sum is half of all the numbers).
It seems the generic formula for updating velocities for particles is
but I won't go into too much detail for this question.
Since there's no PSO attempt online for the Subset Sum Problem, I looked at the Travelling Salesman Problem instead.
They're approach for updating velocities involved taking sets of visited towns, subtracting one from another and doing some manipulation on that.
I saw no relation between that and the formula above.
My Approach
So I scrapped the formula and tried my own approach to the Subset Sum Problem.
I basically used gbest and pbest to determine the probability of removing or adding a particular element to the subset.
i.e - if my problem space is [1,2,3,4,5] (target is 7 or 8), and my current particle (subset) has [1,None,3,None,None], and the gbest is [None,2,3,None,None] then there is a higher probability of keeping 3, adding 2 and removing 1, based on gbest
I can post code but don't think it's necessary, you get the idea (I'm using python btw - hence None).
So basically, this worked to an extent, I got decent solutions out but it was very slow on larger data sets and values.
My Question
Am I encoding the problem and updating the particle "velocities" in a smart way?
Is there a way to determine if this will converge correctly?
Is there a resource I can use to learn how to create convergent "update" formulas for specific problem spaces?
Thanks a lot in advance!
Encoding
Yes, you're encoding this correctly: each of your bit-maps (that's effectively what your 5-element lists are) is a particle.
Concept
Your conceptual problem with the equation is because your problem space is a discrete lattice graph, which doesn't lend itself immediately to the update step. For instance, if you want to get a finer granularity by adjusting your learning rate, you'd generally reduce it by some small factor (say, 3). In this space, what does it mean to take steps only 1/3 as large? That's why you have problems.
The main possibility I see is to create 3x as many particles, but then have the transition probabilities all divided by 3. This still doesn't satisfy very well, but it does simulate the process somewhat decently.
Discrete Steps
If you have a very large graph, where a high velocity could give you dozens of transitions in one step, you can utilize a smoother distance (loss or error) function to guide your model. With something this small, where you have no more than 5 steps between any two positions, it's hard to work with such a concept.
Instead, you utilize an error function based on the estimated distance to the solution. The easy one is to subtract the particle's total from the nearer of 7 or 8. A harder one is to estimate distance based on that difference and the particle elements "in play".
Proof of Convergence
Yes, there is a way to do it, but it requires some functional analysis. In general, you want to demonstrate that the error function is convex over the particle space. In other words, you'd have to prove that your error function is a reliable distance metric, at least as far as relative placement goes (i.e. prove that a lower error does imply you're closer to a solution).
Creating update formulae
No, this is a heuristic field, based on shape of the problem space as defined by the particle coordinates, the error function, and the movement characteristics.
Extra recommendation
Your current allowable transitions are "add" and "delete" element.
Include "swap elements" to this: trade one present member for an absent one. This will allow the trivial error function to define a convex space for you, and you'll converge in very little time.
I need to find the diameter of the points cloud (two points with maximum distance between them) in 3-dimensional space. As a temporary solution, right now I'm just iterating through all possible pairs and comparing the distance between them, which is a very slow, O(n^2) solution.
I believe it can be done in O(n log n). It's a fairly easy task in 2D (just find the convex hull and then apply the rotating calipers algorithm), but in 3D I can't imagine how to use rotating calipers, since there is no way to order the points.
Is there any simple way to do it (or ready-to-use implementation in python or C/C++)?
PS: There are similar questions on StackOverflow, but the answers that I found only refers to Rotating Calipers (or similar) algorithms, which works fine in 2D situation but not really clear how to implement in 3D (or higher dimensionals).
While O(n log n) expected time algorithms exist in 3d, they seem tricky to implement (while staying competitive to brute-force O(n^2) algorithms).
An algorithm is described in Har-Peled 2001. The authors provide a source code than can optionally be used for optimal computation. I was not able to download the latest version, the "old" version could be enough for your purpose, or you might want to contact the authors for the code.
An alternative approach is presented in Malandain & Boissonnat 2002 and the authors provide code. Altough this algorithm is presented as approximate in higher dimensions, it could fit your purpose. Note that their code provide an implementation of Har-Peled's method for exact computation that you might also check.
In any case, in a real-world usage you should always check that your algorithm remains competitive with respect to the naïve O(n^2) approach.
Picture is worth a thousand words, so:
My input is the matrix on the left, and what I need to find is the sets of nodes that are maximum one step away from each other (not diagonally). Node that is more than one up/down/left/right step away would be in a separate set.
So, my plan was running a BFS from every node I find, then returning the set it traversed through, and removing it from the original set. Iterate this process until I'm done. But then I've had the wild idea of looking for a graph analysis tools - and I've found NetworkX. Is there an easy way (algorithm?) to achieve this without manually writing BFS, and traverse the whole matrix?
Thanks
What you are trying to do is searching for "connected components" and
NetworX has itself a method for doing exactly that as can be seen in the first example on this documentation page as others has already pointed out on the comments.
Reading your question it seems that your nodes are on a discrete grid and the concept of connected that you describe is the same used on the pixel of an image.
Connected components algorithms are available for graphs and for images also.
If performances are important in your case I would suggest you to go for the image version of connected components.
This comes by the fact that images (grids of pixels) are a specific class of graphs so the connected components algorithms dealing with grids of nodes
are built knowing the topology of the graph itself (i.e. graph is planar, the max vertex degree is four). A general algorithm for graphs has o be able to work on general graphs
(i.e they may be not planar, with multiple edges between some nodes) so it has to spend more work because it can't assume much about the properties of the input graph.
Since connected components can be found on graphs in linear time I am not telling the image version would be orders of magnitude faster. There will only be a constant factor between the two.
For this reason you should also take into account which is the data structure that holds your input data and how much time will be spent in creating the input structures which are required by each version of the algorithm.