Cutoff in Closeness/Betweenness Centrality in python igraph - python

I am currently working a large graph, with 1.5 Million Nodes and 11 Million Edges.
For the sake of speed, I checked the benchmarks of the most popular graph libraries: iGraph, Graph-tool, NetworkX and Networkit. And it seems iGraph, Graph-tool and Networkit have similar performance. And I eventually used iGraph.
With the directed graph built with iGraph, the pagerank of all vertices can be calculated in 5 secs. However, when it came to Betweenness and Closeness, it took forever for the calculation.
In the documentation, it says that by specifying "CutOff", iGraph will ignore all path with length < CutOff value.
I am wondering if there a rule of thumb to choose the best CutOff value to choose?

The cutoff really depends on the application and on the netwrok parameters (# nodes, # edges).
It's hard to talk about closeness threshold, since it depends greatly on other parameters (# nodes, # edges,...).
One thing you can know for sure is that every closeness centrality is somewhere between 2/[n(n-1)] (which is minimum, attained at path) and 1/(n-1) (which is maximum, attained at clique or star).
Perhaps better question would be about Freeman centralization of closeness (which is somehow normalized version of closeness that you can better compare between various graphs).
Suggestion:
You can do a grid search for different cutoff values and then choose the one that makes more sense based on your application.

Related

Find all s-t node cuts in directed (or undirected) graph

I need to find all the possible node cuts in an s-t graph.
It is a directed graph, but finding the cuts for the undirected graph could be enough (I can filter them afterwards).
Networkx provides functions as:
all_node_cuts
but it is not implemented for directed graphs (no problem) and it does not consider s-t graphs, so the solution is not useful in my case (at least for cut sets in which source S is included by the function).
I tried to implement the combinatorial approach (check all the possible combinations of nodes) and it works but is extremely inefficient, starting from graphs with more than 10 nodes the execution time is too large.
I tried to check if it's possible to modify network functions like minimum_st_node_cut but did not succeded and I don't know if it possible to list all the possible cuts.
I could also use any other library if it provides some useful tool for this (even programming language, if needed).
IGraph is providing an algorithm, but I cannot evaluate if its computational efficient. For my instance of 93 nodes, it could not find a cut until I stopped it after a few hours. Nevertheless, it works for smaller instances:
import numpy as np
from igraph import Graph
# Create adjacancy matrix
adjacency = np.array([ [0,0,0],
[0,0,1],
[1,0,0],])
iG = Graph.Adjacency((adjacency > 0).tolist())
cuts = iG.all_st_cuts(1,2)

Make cyclic directed graph acyclic (need explanations)

I am looking for a way to make a directed graph acyclic. I have read about [Minimum Feedback Arc Set] and [this post], but I don't understand the solutions enough to implement them.
My goal is to acyclic-ize several graphs, each one having very few nodes (usually less than 50), with low connectivity, but sometimes enough for the graph to be cyclic.
I do have weights on my edges, but I would prefer to minimise the connectivity loss rather than minimising weight loss. I cannot edit the weight values, but I can reverse edges direction.
I am aware that this is not a simple task, so any detailed explanation (and/or code or pseudo-code) would help a lot.
Note : for my current project, I am using Python 3.7 and the networkx package

Community Detection Algorithms using NetworkX

I have a network that is a graph network and it is the Email-Eu network that is available in here.
This dataset has the actual dataset, which is a graph of around 1005 nodes with the edges that form this giant graph. It also has the ground truth labels for the nodes and its corresponding communities (department). Each one of these nodes belongs to one of each 42 departments.
I want to run a community detection algorithm on the graph to find to the corresponding department for each node. My main objective is to find the nodes in the largest community.
So, first I need to find the first 42 departments (Communities), then find the nodes in the biggest one of them.
I started with Girvan-Newman Algorithm to find the communities. The beauty of Girvan-Newman is that it is easy to implement since every time I need to find the edge with the highest betweenness and remove it till I find the 42 departments(Communities) I want.
I am struggling to find other Community Detection Algorithms that give me the option of specifying how many communities/partitions I need to break down my graph into.
Is there any Community Detection Function/Technique that I can use, which gives me the option of specifying how many communities do I need to uncover from my graph? Any ideas are very much appreciated.
I am using Python and NetworkX.
A (very) partial answer (and solution) to your question is to use Fluid Communities algorithm implemented by Networkx as asyn_fluidc.
Note that it works on connected, undirected, unweighted graphs, so if your graph has n connected components, you should run it n times. In fact this could be a significant issue as you should have some sort of preliminary knowledge of each component to choose the corresponding k.
Anyway, it is worth a try.
You may want to try pysbm. It is based on networkx and implements different variants of stochastic block models and inference methods.
If you consider to switch from networkxto a different python based graph package you may want to consider graph-tool, where you would be able to use the stochastic block model for the clustering task. Another noteworthy package is igraph, may want to look at How to cluster a graph using python igraph.
The approaches directly available in networkx are rather old fashioned. If you aim for state of the art clustering methods, you may consider spectral clustering or Infomap. The selection depends on your desired usage of the inferred communities. The task of inferring ground truth from a network, falls under (approximate) the No-Free-Lunch theorem, i.e. (roughly) no algorithm exists, such that it returns "better" communities than any other algorithm, if we average the results over all possibilities.
I am not entirely sure of my answer but maybe you can try this. Are you aware of label propagation ? The main idea is that you have some nodes in graph which are labelled i.e. they belong to a community and you want to give labels to other unlabelled nodes in your graph. LPA will spread these labels across the graph and give you a list of nodes and the communities they belong to. These communities will be the same as the ones that your labelled set of nodes belong to.
So I think you can control the number of communities you want to extract from the graph by controlling the number of communities you initialise in the beginning. But I think it is also possible that after LPA converges some of the communities you initialised vanish from the graph due the graph structure and also randomness of the algorithm. But there are many variants of LPA where you can control this randomness. I believe this page of sklearn talks about it.
You can read about LPA here and also here

NetworkX: Approximate/Inexact Subgraph Isomorphism For Undirected Weighted Graphs

Given two graphs (A and B), I am trying to determine if there exists a subgraph of B that matches A given some threshold based on the difference in edge weights. That is, if I take the sum of the difference between each pair of associated edges, it will be below a specified threshold. The vertex labels are not consistent between A and B, so I am just relying on the edge weights.
A will be somewhat small (e.g. max 10) and B will be larger (e.g. max 200).
I believe one of these two packages may help:
The Graph Matching Toolbox in MATLAB "implements spectral graph matching with affine constraint (SMAC), optionally with kronecker bistochastic normalization". It states on the webpage that it "handles graphs of different sizes (subgraph matching)"
http://www.timotheecour.com/software/graph_matching/graph_matching.html
The algorithm used in the Graph Matching Toolbox in MATLAB is based on the algorithm described in the paper by Timothee Cour, Praveen Srinivasan, and Jianbo Shi titled Balanced Graph Matching. The paper was published in NIPS 2006.
In addition, there is a second toolkit called Graph Matching Toolkit (GMT) that seems like it might have support for error-tolerant subgraph matching, as it does support error-tolerant graph matching. Rather than using a spectral method, it has various methods of computing edit distance, and then it is my impression that it finds the best matching by giving the argmax of the minimum edit distance. If it doesn't explicitly support subgraph matching and you don't care about efficiency, you might just search all subgraphs of B and use GMT to try to find matches of those subgraphs in A. Or maybe you could just search a subset of the subgraphs of B.
http://www.fhnw.ch/wirtschaft/iwi/gmt
Unfortunately neither of these appear to be in Python, and they don't seem to support networkx's graph format either. But I believe you may be able to find a converter that will change the representation of the networkx graph's to something usable by these toolkits. Then you can run the toolkits and output your desired subgraph matchings.

Networkx spring layout edge weights

I was wondering how spring_layout takes edge weight into account. From wikipedia,
'An alternative model considers a spring-like force for every pair of nodes (i,j) where the ideal length \delta_{ij} of each spring is proportional to the graph-theoretic distance between nodes i and j, without using a separate repulsive force. Minimizing the difference (usually the squared difference) between Euclidean and ideal distances between nodes is then equivalent to a metric multidimensional scaling problem.'
How is edge weight factored in, specifically?
This isn't a great answer, but it gives the basics. Someone else may come by who actually knows the Fruchterman-Reingold algorithm and can describe it. I'm giving an explanation based on what I can find in the code.
From the documentation,
weight : string or None optional (default=’weight’)
The edge attribute that holds the numerical value used for the edge weight. If None, then all edge weights are 1.
But that doesn't tell you what it does with the weight, which is your question.
You can find the source code. If you send in weighted edges, it will create an adjacency matrix A with those weights and pass A to _fruchterman_reingold.
Looking at the code there, the meat of it is in this line
displacement=np.transpose(np.transpose(delta)*\
(k*k/distance**2-A*distance/k)).sum(axis=1)
The A*distance is calculating how strong of a spring force is acting on the node. A larger value in the corresponding A entry means that there is a relatively stronger attractive force between those two nodes (or if they are very close together, a weaker repulsive force). Then the algorithm moves the nodes according to the direction and strength of the forces. It then repeats (50 times by default). Interestingly, if you look at the source code you'll notice a t and dt. It appears that at each iteration, the force is multiplied by a smaller and smaller factor, so the steps get smaller.
Here is a link to the paper describing the algorithm, which unfortunately is behind a paywall. Here is a link to the paper on the author's webpage

Categories