What's the right algorithm for finding isolated subsets - python

Picture is worth a thousand words, so:
My input is the matrix on the left, and what I need to find is the sets of nodes that are maximum one step away from each other (not diagonally). Node that is more than one up/down/left/right step away would be in a separate set.
So, my plan was running a BFS from every node I find, then returning the set it traversed through, and removing it from the original set. Iterate this process until I'm done. But then I've had the wild idea of looking for a graph analysis tools - and I've found NetworkX. Is there an easy way (algorithm?) to achieve this without manually writing BFS, and traverse the whole matrix?
Thanks

What you are trying to do is searching for "connected components" and
NetworX has itself a method for doing exactly that as can be seen in the first example on this documentation page as others has already pointed out on the comments.
Reading your question it seems that your nodes are on a discrete grid and the concept of connected that you describe is the same used on the pixel of an image.
Connected components algorithms are available for graphs and for images also.
If performances are important in your case I would suggest you to go for the image version of connected components.
This comes by the fact that images (grids of pixels) are a specific class of graphs so the connected components algorithms dealing with grids of nodes
are built knowing the topology of the graph itself (i.e. graph is planar, the max vertex degree is four). A general algorithm for graphs has o be able to work on general graphs
(i.e they may be not planar, with multiple edges between some nodes) so it has to spend more work because it can't assume much about the properties of the input graph.
Since connected components can be found on graphs in linear time I am not telling the image version would be orders of magnitude faster. There will only be a constant factor between the two.
For this reason you should also take into account which is the data structure that holds your input data and how much time will be spent in creating the input structures which are required by each version of the algorithm.

Related

Match the vertices of two identical labelled graphs

I have a rather simple problem to define but I did not find a simple answer so far.
I have two graphs (ie sets of vertices and edges) which are identical. Each of them has independently labelled vertices. Look at the example below:
How can the computer detect, without prior knowledge of it, that 1 is identical to 9, 2 to 10 and so on?
Note that in the case of symmetry, there may be several possible one to one pairings which give complete equivalence, but just finding one of them is sufficient to me.
This is in the context of a Python implementation. Does someone have a pointer towards a simple algorithm publicly available on the Internet? The problem sounds simple but I simply lack the mathematical knowledge to come up to it myself or to find proper keywords to find the information.
EDIT: Note that I also have atom types (ie labels) for each graphs, as well as the full distance matrix for the two graphs to align. However the positions may be similar but not exactly equal.
This is known as the graph isomorphism problem, and probably very hard; although the exactly details of how hard are still subject of research.
(But things look better if you graphs are planar.)
So, after searching for it a bit, I think that I found a solution that works most of the time for moderate computational cost. This is a kind of genetic algorithm which uses a bit of randomness, but it is practical enough for my purposes it seems. I didn't have any aberrant configuration with my samples so far even if it is theoretically possible that this happens.
Here is how I proceeded:
Determine the complete set of 2-paths, 3-paths and 4-paths
Determine vertex types using both atom type and surrounding topology, creating an "identity card" for each vertex
Do the following ten times:
Start with a random candidate set of pairings complying with the allowed vertex types
Evaluate how much of 2-paths, 3-paths and 4-paths correspond between the two pairings by scoring one point for each corresponding vertex (also using the atom type as an additional descriptor)
Evaluate all other shortlisted candidates for a given vertex by permuting the pairings for this candidate with its other positions in the same way
Sort the scores in descending order
For each score, check if the configuration is among the excluded configurations, and if it is not, take it as the new configuration and put it into the excluded configurations.
If the score is perfect (ie all of the 2-paths, 3-paths and 4-paths correspond), then stop the loop and calculate the sum of absolute differences between the distance matrices of the two graphs to pair using the selected pairing, otherwise go back to 4.
Stop this process after it has been done 10 times
Check the difference between distance matrices and take the pairings associated with the minimal sum of absolute differences between the distance matrices.

Algorithms for constrained clustering on attributed graphs with some cluster-level constraints on their attributes

I have a graph with 240k nodes and 550k edges with five attributes per node coming out of an autoencoder from a sparse dataset. I'm looking to partition the graph into n clusters, such that intra-partition attribute similarity is maximized, the partitions are connected, and the sum of one of the attributes doesn't exceed a threshold for any given cluster.
I've tried poking around with an autoencoder but had issues making a loss function that would get the results I needed. I've also looked at heirarchical clustering with connectivity constraints but can't find a way to enforce my sum constraint optimally. Same issue with community detection algorithms on graphs like Louvain.
If anyone knows of any approaches to solving this I'd love to hear it, ideally something implemented in Python already but I can probably implement whatever algorithm I need should it not be. Thanks!
First of all, the problem is most likely NP-hard, so the best you can do is some greedy optimization. It will definitely help to first break the graph into subsets that cannot be connected ever (remove links of nodes that are not similar enough, then compute the connected components). Then for each component (which hopefully are much smaller than 250k, otherwise tough luck!) run a classic optimizer that allows you to specify the cost function. It is probably a good idea to use an integer linear program, and consider the Lagrange dual version of the problem.

Community Detection Algorithms using NetworkX

I have a network that is a graph network and it is the Email-Eu network that is available in here.
This dataset has the actual dataset, which is a graph of around 1005 nodes with the edges that form this giant graph. It also has the ground truth labels for the nodes and its corresponding communities (department). Each one of these nodes belongs to one of each 42 departments.
I want to run a community detection algorithm on the graph to find to the corresponding department for each node. My main objective is to find the nodes in the largest community.
So, first I need to find the first 42 departments (Communities), then find the nodes in the biggest one of them.
I started with Girvan-Newman Algorithm to find the communities. The beauty of Girvan-Newman is that it is easy to implement since every time I need to find the edge with the highest betweenness and remove it till I find the 42 departments(Communities) I want.
I am struggling to find other Community Detection Algorithms that give me the option of specifying how many communities/partitions I need to break down my graph into.
Is there any Community Detection Function/Technique that I can use, which gives me the option of specifying how many communities do I need to uncover from my graph? Any ideas are very much appreciated.
I am using Python and NetworkX.
A (very) partial answer (and solution) to your question is to use Fluid Communities algorithm implemented by Networkx as asyn_fluidc.
Note that it works on connected, undirected, unweighted graphs, so if your graph has n connected components, you should run it n times. In fact this could be a significant issue as you should have some sort of preliminary knowledge of each component to choose the corresponding k.
Anyway, it is worth a try.
You may want to try pysbm. It is based on networkx and implements different variants of stochastic block models and inference methods.
If you consider to switch from networkxto a different python based graph package you may want to consider graph-tool, where you would be able to use the stochastic block model for the clustering task. Another noteworthy package is igraph, may want to look at How to cluster a graph using python igraph.
The approaches directly available in networkx are rather old fashioned. If you aim for state of the art clustering methods, you may consider spectral clustering or Infomap. The selection depends on your desired usage of the inferred communities. The task of inferring ground truth from a network, falls under (approximate) the No-Free-Lunch theorem, i.e. (roughly) no algorithm exists, such that it returns "better" communities than any other algorithm, if we average the results over all possibilities.
I am not entirely sure of my answer but maybe you can try this. Are you aware of label propagation ? The main idea is that you have some nodes in graph which are labelled i.e. they belong to a community and you want to give labels to other unlabelled nodes in your graph. LPA will spread these labels across the graph and give you a list of nodes and the communities they belong to. These communities will be the same as the ones that your labelled set of nodes belong to.
So I think you can control the number of communities you want to extract from the graph by controlling the number of communities you initialise in the beginning. But I think it is also possible that after LPA converges some of the communities you initialised vanish from the graph due the graph structure and also randomness of the algorithm. But there are many variants of LPA where you can control this randomness. I believe this page of sklearn talks about it.
You can read about LPA here and also here

Pattern Matching or comparing two graphs (line charts)

Given:
Ideal graph - Depicts the expected reading my machine should have.
Actual graph - Depicts the actual reading my machine had at that instance.
X-axis: Force(N) from the machine
Y-axis: Time(s)
Both graphs were created using pyplot library in python.
What I need to do:
I need to compare the graph in its three phases: initialization (machine starts applying force), constant phase (constant force), end phase (machine stops applying force) and give the analysis of how close the phases in the actual read were to the ideal case (in terms of percentage). The analysis would allow me to conclude how the machine performed in those three phases for the actual read taken. I would need to do this for each reading taken every 50s.
Hurdle:
Now both the graphs were not created using the same number of data points. Ideal graph was created with 100 set of points and Actual graph was created using 30,000+ points. So I would not be able to compare the graphs using data points.
Idea:
Would it be wise to save the graph of the actual read as a png and compare it with the image of the ideal case graph?
Please give me some idea or solution to tackle this problem.
It's a bit late but I'm going to answer anyway:
I don't think resorting to a comparison of images is wise in this case, no.
What you probably want is to interpolate additional points between the 100 points on the 'Ideal graph' to match the 30,000+ points in the 'Actual graph'.
The example on 1-D Interpolation in the scipy.interpolate docs seems to be exactly what you need.
If you need further assistance (such as working code), you will have to provide a Minimal, Complete, and Verifiable Example for us to work with.

Using Python to generate a connection/network graph

I have a text file with about 8.5 million data points in the form:
Company 87178481
Company 893489
Company 2345788
[...]
I want to use Python to create a connection graph to see what the network between companies looks like. From the above sample, two companies would share an edge if the value in the second column is the same (clarification from/for Hooked).
I've been using the NetworkX package and have been able to generate a network for a few thousand points, but it's not making it through the full 8.5 million-node text file. I ran it and left for about 15 hours, and when I came back, the cursor in the shell was still blinking, but there was no output graph.
Is it safe to assume that it was still running? Is there a better/faster/easier approach to graph millions of points?
If you have 1000K points of data, you'll need some way of looking at the broad picture. Depending on what you are looking for exactly, if you can assign a "distance" between companies (say number of connections apart) you can visualize relationships (or clustering) via a Dendrogram.
Scipy does clustering:
http://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html#module-scipy.cluster.hierarchy
and has a function to turn them into dendrograms for visualization:
http://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.dendrogram.html#scipy.cluster.hierarchy.dendrogram
An example for a shortest path distance function via networkx:
http://networkx.lanl.gov/reference/generated/networkx.algorithms.shortest_paths.generic.shortest_path.html#networkx.algorithms.shortest_paths.generic.shortest_path
Ultimately you'll have to decide how you want to weight the distance between two companies (vertices) in your graph.
You have too many datapoints and if you did visualize the network it won't make any sense. You need to have ways to 1)reduce the number of companies by removing those that are less important/less connected 2)summarize the graph somehow and then visualize.
to reduce the size of data it might be better to create the network independently (using your own code to create an edgelist of companies). This way you can reduce the size of your graph (by removing singletons for example, which may be many).
For summarization I recommend running a clustering or a community detection algorithm. This can be done very fast even for very large networks. Use the "fastgreedy" method in the igraph package: http://igraph.sourceforge.net/doc/R/fastgreedy.community.html
(there is a faster algorithm available online as well, this is by Blondel et al: http://perso.uclouvain.be/vincent.blondel/publications/08BG.pdf I know their code is available online somewhere)

Categories