Make cyclic directed graph acyclic (need explanations) - python

I am looking for a way to make a directed graph acyclic. I have read about [Minimum Feedback Arc Set] and [this post], but I don't understand the solutions enough to implement them.
My goal is to acyclic-ize several graphs, each one having very few nodes (usually less than 50), with low connectivity, but sometimes enough for the graph to be cyclic.
I do have weights on my edges, but I would prefer to minimise the connectivity loss rather than minimising weight loss. I cannot edit the weight values, but I can reverse edges direction.
I am aware that this is not a simple task, so any detailed explanation (and/or code or pseudo-code) would help a lot.
Note : for my current project, I am using Python 3.7 and the networkx package

Related

Algorithms for constrained clustering on attributed graphs with some cluster-level constraints on their attributes

I have a graph with 240k nodes and 550k edges with five attributes per node coming out of an autoencoder from a sparse dataset. I'm looking to partition the graph into n clusters, such that intra-partition attribute similarity is maximized, the partitions are connected, and the sum of one of the attributes doesn't exceed a threshold for any given cluster.
I've tried poking around with an autoencoder but had issues making a loss function that would get the results I needed. I've also looked at heirarchical clustering with connectivity constraints but can't find a way to enforce my sum constraint optimally. Same issue with community detection algorithms on graphs like Louvain.
If anyone knows of any approaches to solving this I'd love to hear it, ideally something implemented in Python already but I can probably implement whatever algorithm I need should it not be. Thanks!
First of all, the problem is most likely NP-hard, so the best you can do is some greedy optimization. It will definitely help to first break the graph into subsets that cannot be connected ever (remove links of nodes that are not similar enough, then compute the connected components). Then for each component (which hopefully are much smaller than 250k, otherwise tough luck!) run a classic optimizer that allows you to specify the cost function. It is probably a good idea to use an integer linear program, and consider the Lagrange dual version of the problem.

Community Detection Algorithms using NetworkX

I have a network that is a graph network and it is the Email-Eu network that is available in here.
This dataset has the actual dataset, which is a graph of around 1005 nodes with the edges that form this giant graph. It also has the ground truth labels for the nodes and its corresponding communities (department). Each one of these nodes belongs to one of each 42 departments.
I want to run a community detection algorithm on the graph to find to the corresponding department for each node. My main objective is to find the nodes in the largest community.
So, first I need to find the first 42 departments (Communities), then find the nodes in the biggest one of them.
I started with Girvan-Newman Algorithm to find the communities. The beauty of Girvan-Newman is that it is easy to implement since every time I need to find the edge with the highest betweenness and remove it till I find the 42 departments(Communities) I want.
I am struggling to find other Community Detection Algorithms that give me the option of specifying how many communities/partitions I need to break down my graph into.
Is there any Community Detection Function/Technique that I can use, which gives me the option of specifying how many communities do I need to uncover from my graph? Any ideas are very much appreciated.
I am using Python and NetworkX.
A (very) partial answer (and solution) to your question is to use Fluid Communities algorithm implemented by Networkx as asyn_fluidc.
Note that it works on connected, undirected, unweighted graphs, so if your graph has n connected components, you should run it n times. In fact this could be a significant issue as you should have some sort of preliminary knowledge of each component to choose the corresponding k.
Anyway, it is worth a try.
You may want to try pysbm. It is based on networkx and implements different variants of stochastic block models and inference methods.
If you consider to switch from networkxto a different python based graph package you may want to consider graph-tool, where you would be able to use the stochastic block model for the clustering task. Another noteworthy package is igraph, may want to look at How to cluster a graph using python igraph.
The approaches directly available in networkx are rather old fashioned. If you aim for state of the art clustering methods, you may consider spectral clustering or Infomap. The selection depends on your desired usage of the inferred communities. The task of inferring ground truth from a network, falls under (approximate) the No-Free-Lunch theorem, i.e. (roughly) no algorithm exists, such that it returns "better" communities than any other algorithm, if we average the results over all possibilities.
I am not entirely sure of my answer but maybe you can try this. Are you aware of label propagation ? The main idea is that you have some nodes in graph which are labelled i.e. they belong to a community and you want to give labels to other unlabelled nodes in your graph. LPA will spread these labels across the graph and give you a list of nodes and the communities they belong to. These communities will be the same as the ones that your labelled set of nodes belong to.
So I think you can control the number of communities you want to extract from the graph by controlling the number of communities you initialise in the beginning. But I think it is also possible that after LPA converges some of the communities you initialised vanish from the graph due the graph structure and also randomness of the algorithm. But there are many variants of LPA where you can control this randomness. I believe this page of sklearn talks about it.
You can read about LPA here and also here

Cutoff in Closeness/Betweenness Centrality in python igraph

I am currently working a large graph, with 1.5 Million Nodes and 11 Million Edges.
For the sake of speed, I checked the benchmarks of the most popular graph libraries: iGraph, Graph-tool, NetworkX and Networkit. And it seems iGraph, Graph-tool and Networkit have similar performance. And I eventually used iGraph.
With the directed graph built with iGraph, the pagerank of all vertices can be calculated in 5 secs. However, when it came to Betweenness and Closeness, it took forever for the calculation.
In the documentation, it says that by specifying "CutOff", iGraph will ignore all path with length < CutOff value.
I am wondering if there a rule of thumb to choose the best CutOff value to choose?
The cutoff really depends on the application and on the netwrok parameters (# nodes, # edges).
It's hard to talk about closeness threshold, since it depends greatly on other parameters (# nodes, # edges,...).
One thing you can know for sure is that every closeness centrality is somewhere between 2/[n(n-1)] (which is minimum, attained at path) and 1/(n-1) (which is maximum, attained at clique or star).
Perhaps better question would be about Freeman centralization of closeness (which is somehow normalized version of closeness that you can better compare between various graphs).
Suggestion:
You can do a grid search for different cutoff values and then choose the one that makes more sense based on your application.

What's the right algorithm for finding isolated subsets

Picture is worth a thousand words, so:
My input is the matrix on the left, and what I need to find is the sets of nodes that are maximum one step away from each other (not diagonally). Node that is more than one up/down/left/right step away would be in a separate set.
So, my plan was running a BFS from every node I find, then returning the set it traversed through, and removing it from the original set. Iterate this process until I'm done. But then I've had the wild idea of looking for a graph analysis tools - and I've found NetworkX. Is there an easy way (algorithm?) to achieve this without manually writing BFS, and traverse the whole matrix?
Thanks
What you are trying to do is searching for "connected components" and
NetworX has itself a method for doing exactly that as can be seen in the first example on this documentation page as others has already pointed out on the comments.
Reading your question it seems that your nodes are on a discrete grid and the concept of connected that you describe is the same used on the pixel of an image.
Connected components algorithms are available for graphs and for images also.
If performances are important in your case I would suggest you to go for the image version of connected components.
This comes by the fact that images (grids of pixels) are a specific class of graphs so the connected components algorithms dealing with grids of nodes
are built knowing the topology of the graph itself (i.e. graph is planar, the max vertex degree is four). A general algorithm for graphs has o be able to work on general graphs
(i.e they may be not planar, with multiple edges between some nodes) so it has to spend more work because it can't assume much about the properties of the input graph.
Since connected components can be found on graphs in linear time I am not telling the image version would be orders of magnitude faster. There will only be a constant factor between the two.
For this reason you should also take into account which is the data structure that holds your input data and how much time will be spent in creating the input structures which are required by each version of the algorithm.

Algorithm of community_edge_betweenness() in python-igraph implementation

I had to shift from using community_fastgreedy() to community_edge_betweenness() due to the inability of community_fastgreedy() to handle directed graphs (directed unweighted graph).
My understanding is that community_fastgreedy() is bottom-up approach while community_edge_betweenness() is top-down and both work on the principle of finding communities that maximize modularity, one by merging communities and the other by removing edges.
In the original paper by M.Girvan and M.E.J.Newman "Community structure in social and biological networks", there is no mention of it being able to handle directed graph. This is being used for community_edge_betweenness().
I referred here and Link documentation to get more information on the algorithm for directed networks.
My questions are -
Is my understanding of, community_fastgreedy() and community_edge_betweenness() implementation in python-igraph depend on maximizing modularity, correct.
Can you please point me to the documentation of how community_edge_betweenness is implemented to handle directed network in python-igraph or to a newer version of the paper by Girvan and Newman.
Since i am new to community detection any pointers are useful.
I am aware of better methods (Louvain, Infomap) but still need to use CNM or GN for comparision purposes.
Thanks.
community_edge_betweenness() does not try to maximize modularity. Modularity is only used as a rule of thumb to decide where to "cut" the dendrogram generated by the algorithm if the user insists on a "flat" community structure instead of a flat dendrogram.
community_edge_betweenness() "handles" directed graphs simply by looking for directed paths instead of undirected ones when it calculates the edge betweenness scores for the edges (which are then used in turn to decide which edge to remove at a particular step). As far as I know, no research has been made on whether this approach is scientifically sound and correct or not.
The reason why most community detection methods (especially the ones that are maximizing modularity) do not cater for directed graphs is because the concept of a "community" is not well-defined for directed graphs - most of the algorithms look for parts in the graph that are "denser than expected by chance", but this vague definition does not say anything about how the directions of edges should be used. Also, there are multiple (conflicting) extensions of the modularity score for directed graphs.
As far as I know, the only method in igraph that has a "formal" treatment to the problem of communities in directed networks is the InfoMap algorithm. InfoMap defines communities based on minimal encodings of random walks within graphs so it is able to take the edge directions into account accurately - roughly speaking, communities found by the InfoMap algorithm are groups of nodes for which a random walker has a small probability of "escaping" from the group. (The InfoMap homepage has a nice visual explanation). So, if you really need to find communities in a directed graph, I would suggest using the InfoMap method.

Categories