Combinatorially generating subgraphs between a given upper and lower bound (Sage) - python

I am new to sage and I have been reading the documentation but this is very new territory for me and it is a bit tough to get a good grasp on everything.
What I want to do is, given an adjacency matrix, an upper bounds, and a lower bounds - generate all pathways through that matrix, where a "pathway" is comprised of one entry from each row, such that the weight of the pathway is equal to or between the bounds.
Even better would be if I could organize the given pathways by 1.) The amount amount of edges with a lower weight in each row than the entry in the pathway, and/or 2.) Minimum overlap with other pathways in regards to #1.
For clarity, a quick example.
Given the 4x4 matrix:
[[1,2,3,4],
[5,6,7,8],
[10,11,12,13],
[20,21,22,23]]
And an upper bound 38, and a lower bound 37, possible pathways could be:
2,5,10,20
3,5,10,20
2,6,10,20
2,5,11,20
2,5,10,21
etc etc. I don't want to write out all the pathways, so hopefully you get the idea.
Even better would be if I could quickly filter out redundancy, by not including pathways that are subsets of other pathways (for example, 2,5,10,20 is encompassed by 3,5,10,20 - since for each pathway I plan on including all lower-weight edges of the each respective row).

If you have a symmetric matrix (with or without diagonal entries nonzero) you can do this.
M = matrix([[0,2,3,4],[2,0,7,8], [3,7,0,13], [4,8,13,0]])
G = Graph(M,format='weighted_adjacency_matrix')
G.graphplot(edge_labels=True,spring=True).show()
And hopefully from G itself you could do what you want. (Unless it has nothing to do with graphs and only the matrix, in which case you may want a different thing entirely.)
I'm not sure exactly what you are trying to do (how do the "pathways" correspond to subgraphs?), and probably describing that would be beyond the scope of this site (as opposed to math.SX.com), but the generic graph documentation has some path methods and the undirected documentation may also prove helpful.

Related

Python data structure to implement photon simulation

The Problem
I have a function, with 2 independent variables, using which I need to construct a 2-D grid, with parameters of my choice. These parameters will be the ranges of the two independent variables, smallest/largest cell size etc.
Visually this can be thought of as a space partition data structure whose geometry is provided by my function. To each cell, I will be then assigning some properties (that will also be determined by the function and other things).
My Idea
Once such data structure is prepared, I can simulate a photon (in a Monte-Carlo based approach) which will travel to a cell randomly (with some constraints, which are given by the function and cell properties), will be absorbed or scattered (re-emitted) from that cell with some probabilities (at which point I will be solving the radiative transfer equation in that cell). After this, the photon, if re-emitted, with its wavelength now different, moves to a neighbouring cell, and will keep moving till it escapes the grid (whose boundaries were decided by the parameters) or is completely absorbed in one of the cells. So, my data structure should also be in such a way that it can access the nearest neighboring cells efficiently from a computation point of view.
What I have done
I looked into scipy.spatial.kdtree but I am not sure I will be able to assign properties to cells the way I want (though if someone can explain that, it would be really helpful as it is very good with accessing nearest neighbours). I have also looked at othe tree-based algorithms but I am a bit lost at how to implement them. Numpy arrays are, at the end of the day, matrices, which does not suit my function (it leads to wastage of memory).
So any suggestions on data structures I can use (and some nudge towards how I can) will be really nice.

How to find the largest linked grouping of sub-graphs?

I want to determine the largest contiguous (if that’s the right word) graph consisting of a bunch of sub-graphs. I define two sub-graphs as being contiguous if any of the nodes between the two sub-graphs are linked.
My initial solution to this is very slow and cumbersome and stupid – just to look at each sub-graph, see if it’s linked to any of the other sub-graphs, and do the analysis for all of the sub-graphs to find the largest number of linked sub-graphs. That’s just me coming from a Fortran background. Is there a better way to do it – a pythonic way, even a graph theory way? I imagine this is a standard question in network science.
A good starting point to answer the kind of question you've asked is to look at a merge-find (or disjoint-set) approach (https://en.wikipedia.org/wiki/Disjoint-set_data_structure).
It's offers an efficient algorithm (at least on an amortized basis) to identify which members of a collection of graphs are disjoint and which aren't.
Here are a couple of related questions that have pointers to additional resources about this algorithm (also know as "union-find"):
Union find implementation using Python
A set union find algorithm
You can get quite respectable performance by merging two sets using "union by rank" as summarized in the Wikipedia page (and the pseudocode provided therein):
For union by rank, a node stores its rank, which is an upper bound for its height. When a node is initialized, its rank is set to zero. To merge trees with roots x and y, first compare their ranks. If the ranks are different, then the larger rank tree becomes the parent, and the ranks of x and y do not change. If the ranks are the same, then either one can become the parent, but the new parent's rank is incremented by one. While the rank of a node is clearly related to its height, storing ranks is more efficient than storing heights. The height of a node can change during a Find operation, so storing ranks avoids the extra effort of keeping the height correct.
I believe there may be even more sophisticated approaches, but the above union-by-rank implementation is what I have used in the past.

Match the vertices of two identical labelled graphs

I have a rather simple problem to define but I did not find a simple answer so far.
I have two graphs (ie sets of vertices and edges) which are identical. Each of them has independently labelled vertices. Look at the example below:
How can the computer detect, without prior knowledge of it, that 1 is identical to 9, 2 to 10 and so on?
Note that in the case of symmetry, there may be several possible one to one pairings which give complete equivalence, but just finding one of them is sufficient to me.
This is in the context of a Python implementation. Does someone have a pointer towards a simple algorithm publicly available on the Internet? The problem sounds simple but I simply lack the mathematical knowledge to come up to it myself or to find proper keywords to find the information.
EDIT: Note that I also have atom types (ie labels) for each graphs, as well as the full distance matrix for the two graphs to align. However the positions may be similar but not exactly equal.
This is known as the graph isomorphism problem, and probably very hard; although the exactly details of how hard are still subject of research.
(But things look better if you graphs are planar.)
So, after searching for it a bit, I think that I found a solution that works most of the time for moderate computational cost. This is a kind of genetic algorithm which uses a bit of randomness, but it is practical enough for my purposes it seems. I didn't have any aberrant configuration with my samples so far even if it is theoretically possible that this happens.
Here is how I proceeded:
Determine the complete set of 2-paths, 3-paths and 4-paths
Determine vertex types using both atom type and surrounding topology, creating an "identity card" for each vertex
Do the following ten times:
Start with a random candidate set of pairings complying with the allowed vertex types
Evaluate how much of 2-paths, 3-paths and 4-paths correspond between the two pairings by scoring one point for each corresponding vertex (also using the atom type as an additional descriptor)
Evaluate all other shortlisted candidates for a given vertex by permuting the pairings for this candidate with its other positions in the same way
Sort the scores in descending order
For each score, check if the configuration is among the excluded configurations, and if it is not, take it as the new configuration and put it into the excluded configurations.
If the score is perfect (ie all of the 2-paths, 3-paths and 4-paths correspond), then stop the loop and calculate the sum of absolute differences between the distance matrices of the two graphs to pair using the selected pairing, otherwise go back to 4.
Stop this process after it has been done 10 times
Check the difference between distance matrices and take the pairings associated with the minimal sum of absolute differences between the distance matrices.

Python - multi-dimensional clustering with thresholds

Imagine I have a dataset as follows:
[{"x":20, "y":50, "attributeA":90, "attributeB":3849},
{"x":34, "y":20, "attributeA":86, "attributeB":5000},
etc.
There could be a bunch more other attributes in addition to these - this is just an example. What I am wondering is, how can I cluster these points based on all of the factors with control over the maximum separation between a given point and the next for a given variable for it to be considered linked. (i.e. euclidean distance must be within 10 points, attributeA within 5 points and attributeB within 1000 points)
Any ideas on how to do this in python? As I implied above, I would like to apply euclidean distance to compare distance between the two points if possible - not just comparing x and y as separate attributes. For the rest of the attributes it would be all single dimensional comparison...if that makes sense.
Edit: Just to add some clarity in case this doesn't make sense, basically I am looking for some algorithm to compare all objects with each other (or some more efficient way), if all of object A's attributes and euclidean distance are within the specified threshold when compared to object B, then those two are considered similar and linked - this procedure continues until eventually all the linked clusters can be returned as some clusters will have no points that satisfy the conditions to be similar to any point in another cluster resulting in the clusters being separated.
The simplest approach is to build a binary "connectivity" matrix.
Let a[i,j] be 0 exactly if your conditions are fullfilled, 1 otherwise.
Then run hierarchical agglomerative clustering with complete linkage on this matrix. If you don't need every pair of objects in every cluster to satisfy your threshold, then you can also use other linkages.
This isn't the best solution - other distance matrix will need O(n²) memory and time, and the clustering even O(n³), but the easiest to implement. Computing the distance matrix in Python code will be really slow unless you can avoid all loops and have e.g. numpy do most of the work. To improve scalability, you should consider DBSCAN, and a data index.
It's fairly straightforward to replace the three different thresholds with weights, so that you can get a continuous distance; likely even a metric. Then you could use data indexes, and try out OPTICS.

mlpy - Dynamic Time Warping depends on x?

I am trying to get the distance between these two arrays shown below by DTW.
I am using the Python mlpy package that offers
dist, cost, path = mlpy.dtw_std(y1, y2, dist_only=False)
I understand that DTW does take care of the "shifting". In addition, as can be seen from above, the mlpy.dtw_std() only takes in 2 1-D arrays. So I expect that no matter how I left/right shift my curves, the dist returned by the function should never change.
However after shifting my green curve a bit to the right, the dist returned by mlpy.dtw_std() changes!
Before shifting: Python mlpy.dwt_std reports dist = 14.014
After shifting: Python mlpy.dwt_std reports dist = 38.078
Obviously, since the curves are still those two curves, I don't expect the distances to be different!
Why is it so? Where went wrong?
Let me reiterate what I have understood, please correct me if I am going wrong anywhere. I observe that in both your plots, your 1D series in blue is remaining identical, while green colored is getting stretched. How you are doing it, that you have explained it in the post on Sep 19 '13 at 9:36. Your premise is that because (1) DTW 'takes care' of time shift and (2) all that you are doing is stretching one time-series length-wise, not affecting y-values, (Inference:) you are expecting distance to remain the same.
There is a little missing link between [(1),(2)] and [(Inference)]. Which is, individual distance values corresponding to mappings WILL change as you change set of signals itself. And this will result into difference in the overall distance computation. Plot the warping paths, cost grid to see it for yourself.
Let's take an oversimplified case...
Let
a=range(0,101,5) = [0,5,10,15...95, 100]
and b=range(0,101,5) = [0,5,10,15...95, 100].
Now intuitively speaking, you/I would expect one to one correspondence between 2 signals (for DTW mapping), and distance for all of the mappings to be 0, signals being identically looking.
Now if we make, b=range(0,101,4) = [0,4,8,12...96,100],
DTW mapping between a and b still would start with a's 0 getting mapped to b's 0, and end at a's 100 getting mapped to b's 100 (boundary constraints). Also, because DTW 'takes care' of time shift, I would also expect 20's, 40's, 60's and 80's of the two signals to be mapped with one another. (I haven't tried DTWing these two myself, saying it from intuition, so please check. There is little possibility of non-intuitive warpings taking place as well, depending on step patterns allowed / global constraints, but let's go with intuitive warpings for the moment for the ease of understanding / sake of simplicity).
For the remaining data points, clearly, distances corresponding to mapping are now non-zero, therefore the overall distance too is non-zero. Our distance/overall cost value has changed from zero to something that is non-zero.
Now, this was the case when our signals were too simplistic, linearly increasing. Imagine the variabilities that will come into picture when you have real life non-monotonous signals, and need to find time-warping between them. :)
(PS: Please don't forget to upvote answer :D). Thanks.
Obviously, the curves are not identical, and therefore the distance function must not be 0 (otherwise, it is not a distance by definition).
What IS "relatively large"? The distance probably is not infinite, is it?
140 points in time, each with a small delta, this still adds up to a non-zero number.
The distance "New York" to "Beijing" is roughly 11018 km. Or 1101800000 mm.
The distance to Alpha Centauri is small, just 4.34 lj. That is the nearest other stellar system to us...
Compare with the distance to a non-similar series; that distance should be much larger.

Categories