Python: Create a graph with defined number of edges per node

Python: Create a graph with defined number of edges per node - python

How I can create a graph with
-predefined number of connections for each node, say 3
-given distribution of connections (say Poisson distribution with given mean)
Thanks

If you are using NetworkX you might try the "configuration model".
This was discussed in the SO question Generating a graph with certain degree distribution?
In graph theory terminology the number of connections is called "degree". And graphs with uniform degree (all nodes the same) are called "regular graphs".

First you have to define a graph data type:
class Graph:
def __init__(self):
self.related_nodes = set()
Then you define a factory function for this data structure that does what you want. For example:
def build_n_edge_graph(n):
nodes = [Graph() for _ in n]
for i, node in enumerate(nodes):
for j in range(n):
if i != j:
n.related.add(node)
(untested!)
Or some other algorithm.

It seems to me you should
decide how many nodes you will have
generate the number of links per node in your desired distribution - make sure the sum is even
start randomly connecting pairs of nodes until all link requirements are satisfied
There are a few more constraints - no pair of nodes should be connected more than once, no node should have more than (number of nodes - 1) links, maybe you want to ensure the graph is fully connected - but basically that's it.

Related

How to get a random component subgraph with given number of edges from the existing component graph?

I would like to make a component subgraph with edges from the existing component graph in networkx. Both graphs will be undirected.
For example, I want my new graph to have 100 edges from the existing one and be component. The existing one has about 2 million edges and is component.
My current approach is below:
def get_random_component(number_of_edges):
G_def = nx.Graph()
G_copy = nx.Graph()
G_iter = nx.Graph()
G_copy.add_edges_from(G.edges)
for i in range(number_of_edges):
G_iter.clear()
G_iter.add_edges_from(G_copy.edges)
currently_found_edge = random.choices(list(G_iter.edges), k=1)[0]
while (G_def.has_edge(*currently_found_edge) or (not G_def.has_node(currently_found_edge[0])
and not G_def.has_node(currently_found_edge[1]))):
G_iter.remove_edge(*currently_found_edge)
currently_found_edge = random.choices(list(G_iter.edges), k=1)[0]
G_def.add_edge(*currently_found_edge)
G_copy.remove_edge(*currently_found_edge)
return G_def
but it is very time-consuming. Is there a better way to find a random component subgraph with given number of edges?

Yes. First, when you're asking for algorithm help, post your algorithm in an easily-readable form. Code is fine, but only if you use meaningful variable names: def, copy, and iter don't mean much.
Your posted algorithm goes through a lot of failure pain in that while loop, especially with your given case of building a 100-edge component from a graph of 2e6 edges. Unless the graph is heavily connected, you will spin a lot for each new edge.
Instead of flailing through the graph, construct a connected subgraph. Let's call it SG. Also, assume that G below is a copy of the original graph that we can mutate as desired.
new_node = a random node of G.
legal_move = new_node.edges() # A set of edges you can traverse from nodes in SG
for _ in range(100):
if legal_move is empty:
# You have found a connected component
# with fewer than 100 edges.
# SG is the closure of that component.
# Break out of this loop, Subtract SG from G, and start over.
new_edge = # a random edge from legal_move (see note below)
subtract SG from legal_move (don't consider any edges already used).
add new_edge to SG
One node of new_edge is already in SG; if the other is not ...
Add other node to SG
Add other.edges to legal_move (except for new_edge)
note on random choice:
You have to define your process for "random". One simple way is to
choose a random edge from legal_moves. Another way is to choose a
random node from those in legal_moves, and then choose a random edge
from that node. Your growth patterns will differ, depending on the
degree of each node.
The process above will be much faster for most graphs.
Since each edge references both of its nodes, and each node maintains a list of its edges, the exploration and update phases will be notably faster.
Coding is left as an exercise for the student. :-)

OSMNX graph to distance matrix and DBSCAN

I'm having troubles programming to get the distance matrix from a Graph of coordinates (LAT,LON).
I want to connect an arbitrarily group of points (let's say, 200.000 firms), get their nearests representations in the Graph, created with ox.graph_from_place().
I am working with dask arrays and dataframes (da.array, df.DataFrame)
if __name__ == "__main__":
# OPTION B: Use a strongly (instead of weakly) connected graph
Gs = ox.utils_graph.get_largest_component(G, strongly=True)
Gs.__name__ = "Gs"
# attach nearest network node to each firm
df["nn"] = da.array(ox.get_nearest_nodes(Gs, X=df['longitude'], Y=df['latitude'], method='balltree') )
# we'll get distances for each pair of nodes that have firms attached to them
nodes_unique = pd.Series(df['nn'].unique())
nodes_unique.index = nodes_unique.values
# convert MultiDiGraph to DiGraph for simpler faster distance matrix computation
G_dm = nx.DiGraph(Gs)
G_dm.__name__ = "G_dm"
save_pickle(Gs)
save_pickle(G_dm)
print("len df['nn']:", len(df['nn']))
print("len nodes_unique:", len(nodes_unique))
Some code has been avoided to go to the heart of the matter, then, I've tried following https://networkx.org/documentation/stable/reference/algorithms/shortest_paths.html and a network_distance_matrix() function using a strongly connected graph Gs, but this is so inefficient in computation time. I have seen in the docs a bunch of functions, but I haven't seen one to efficiently compute a distance-matrix between pairs of unique nodes belonging to the Graph and following paths through it.
I would like to know if there is any way to parallelize this process, and/or make it go through a generative way and not storing all that RAM.
My objective is to provide a pre-computed matrix in a DBSCAN (sklearn.clustering) model, and it has to be quickly so that I can kind of "grid-search" through parameters. I am a beginner to these libraries.

Keeping scale free graph's degree distribution after perturbation - python

Given a scale free graph G ( a graph whose degree distribution is a power law), and the following procedure:
for i in range(C):
coint = randint(0,1)
if (coint == 0):
delete_random_edges(G)
else:
add_random_edge(G)
(C is a constant)
So, when C is large, the degree distribution after the procedure would be more like G(n,p). I am interested in preserving the power law distribution, i.e. - I want the graph to be scale free after this procedure, even for large C.
My idea is writing the procedures "delete_random_edges" and "add_random_edge" in a way that will give edges that connected to node with big degree small probability to be deleted (when adding new edge, it would be more likely to add it to node with large degree).
I use Networkx to represent the graph, and all I found is procedures that delete or add a specific edge. Any idea how can I implement the above?

Here's 2 algorithms:
Algorithm 1
This algorithm does not preserve the degree exactly, rather it preserves the expected degree.
Save each node's initial degree. Then delete edges at random. Whenever you create an edge, do so by randomly choosing two nodes, each with probability proportional to the initial degree of those nodes.
After a long period of time, the expected degree of each node 'u' is its initial degree (but it might be a bit higher or lower).
Basically, this will create what is called a Chung-Lu random graph. Networkx has a built in algorithm for creating them.
Note - this will allow the degree distribution to vary.
algorithm 1a
Here is the efficient networkx implementation skipping over the degree deleting and adding and going straight to the final result (assuming a networkx graph G):
degree_list = G.degree().values()
H = nx.expected_degree_graph(degree_list)
Here's the documentation
Algorithm 2
This algorithm preserves the degrees exactly.
Choose a set of edges and break them. Create a list, with each node appearing equal to the number of broken edges it was in. Shuffle this list. Create new edges between nodes that appear next to each other in this list.
Check to make sure you never join a node to itself or to a node which is already a neighbor. If this would occur you'll want to think of a custom way to avoid it. One option is to simply reshuffle the list. Another is to set those nodes aside and include them in the list you create next time you do this.
edit
There is a built in networkx command double_edge_swapto swap two edges at a time. documentation

Although you have already accepted the answer from #abdallah-sobehy, meaning that it works, I would suggest a more simple approach, in case it helps you or anybody around.
What you are trying to do is sometimes called preferential attachment (well, at least when you add nodes) and for that there is a random model developed quite some time ago, see Barabasi-Albert model, which leads to a power law distribution of gamma equals -3.
Basically you have to add edges with probability equal to the degree of the node divided by the sum of the degrees of all the nodes. You can scipy.stats for defining the probability distribution with a code like this,
import scipy.stats as stats
x = Gx.nodes()
sum_degrees = sum(list(Gx.degree(Gx).values()))
p = [Gx.degree(x)/sum_degrees for x in Gx]
custm = stats.rv_discrete(name='custm', values=(x, p))
Then you just pick 2 nodes following that distribution, and that's the 2 nodes you add an edge to,
custm.rvs(size=2)
As for deleting the nodes, I haven't tried that myself. But I guess you could use something like this,
sum_inv_degrees = sum([1/ x for x in list(Gx.degree(Gx).values())])
p = [1 / (Gx.degree(x) * sum_inv_degrees) for x in Gx]
although honestly I am not completely sure; it is not anymore the random model that I link to above...
Hope it helps anyway.
UPDATE after comments
Indeed by using this method for adding nodes to an existing graph, you could get 2 undesired outcomes:
duplicated links
self links
You could remove those, although it will make the results deviate from the expected distribution.
Anyhow, you should take into account that you are deviating already from the preferential attachment model, since the algorithm studied by Barabasi-Albert works adding new nodes and links to the existing graph,
The network begins with an initial connected network of m_0 nodes.
New nodes are added to the network one at a time. Each new node is connected to m > m_0 existing nodes with a probability that is proportional to the number
...
(see here)
If you want to get an exact distribution (instead of growing an existing network and keeping its properties), you're probably better off with the answer from #joel
Hope it helps.

I am not sure to what extent this will preserve the scale free property but this can be a way to implement your idea:
In order to add an edge you need to specify 2 nodes in networkx. so, you can choose one node with a probability that is proportional to (degree) and the other node to be uniformly chosen (without any preferences). Choosing a highly connected node can be achieved as follows:
For a graph G where nodes are [0,1,2,...,n]
1) create a list of floats (limits) between 0 and 1 to specify for each node a probability to be chosen according to its degree^2. For example: limits[1] - limits[0] is the probability to choose node 0, limits[2] - limits[1] is probability to choose node 2 etc.
# limits is a list that stores floats between 0 and 1 which defines
# the probabaility of choosing a certain node depending on its degree
limits = [0.0]
# store total number of edges of the graph; the summation of all degrees is 2*num_edges
num_edges = G.number_of_edges()
# store the degree of all nodes in a list
degrees = G.degree()
# iterate nodes to calculate limits depending on degree
for i in G:
limits.append(G.degree(i)/(2*num_edges) + limits[i])
2) Randomly generate a number between 0 and 1 then compare it to the limits, and choose the node to add an edge to accordingly:
rnd = np.random.random()
# compare the random number to the limits and choose node accordingly
for j in range(len(limits) - 1):
if rnd >= limits[j] and rnd < limits[j+1]:
chosen_node = G.node[j]
3) Choose another node uniformly, by generating a random integer between [0,n]
4) Add an edge between both of the chosen nodes.
5) Similarly for deleting edge, you can choose a node according to (1/degree) instead of degree then uniformly delete any of its edges.
It is interesting to know if using this approach would reserve the scale free property and at which 'C' the property is lost , so let us know if it worked or not.
EDIT: As suggested by #joel the selection of the node to add an edge to should be proportional to degree rather than degree^2. I have edited step 1 accordingly.
EDIT2: This might help you to be able to judge if the scale free graph lost its property after edges addition and removal. Simply comute the preferential attachmet score before and after changes. You can find the doumentation here.

Generating power-law degree-distributed random directed graphs

I searched for generating random directed graphs $G(V,E)$ with a specific node and edge count, specified in and out degree distributions, without loops and fully connected. I found a function in R in this link.
I searched in networkx, but found only this function, where the graph grows by preferential attachment and hence the number of edges is not controllable.
Is there an equivalent to the R function in Python?

It might not be so easy to generate a graph like that (fixed number of edges, nodes, degree distribution, connected)....
But the directed configuration model might get you mostly there.
http://networkx.github.io/documentation/latest/reference/generated/networkx.generators.degree_seq.directed_configuration_model.html#networkx.generators.degree_seq.directed_configuration_model
Return a directed_random graph with the given degree sequences.
The configuration model generates a random directed pseudograph (graph
with parallel edges and self loops) by randomly assigning edges to
match the given degree sequences.
The example shows how to remove self-loops and parallel edges.
>>> D=nx.DiGraph([(0,1),(1,2),(2,3)]) # directed path graph
>>> din=list(D.in_degree().values())
>>> dout=list(D.out_degree().values())
>>> din.append(1)
>>> dout[0]=2
>>> D=nx.directed_configuration_model(din,dout)
To remove parallel edges:
>>> D=nx.DiGraph(D)
To remove self loops:
>>> D.remove_edges_from(D.selfloop_edges())
You will need to generate both an in-degree and out-degree sequence of your specified length and sum as inputs. If you remove self-loop edges and parallel edges that will likely reduce the number of edges from your original specification.
Also no guarantee your graph will be connected.

Efficiently generating random graphs with a user-specified global clustering coefficient

I'm working on simulations of large-scale neuronal networks, for which I need to generate random graphs that represent the network topology.
I'd like to be able to specify the following properties of these graphs:
Number of nodes, N (~=1000-10000)
Average probability of a connection between any two given nodes, p (~0.01-0.2)
Global clustering coefficient, C (~0.1-0.5)
Ideally, the random graphs should be drawn uniformly from the set of all possible graphs that satisfy these user-specified criteria.
At the moment I'm using a very crude random diffusion approach where I start out with an Erdos-Renyi random network with the desired size and global connection probability, then on each step I randomly rewire some fraction of the edges. If the rewiring got me closer to the desired C then I keep the rewired network into the next iteration.
Here's my current Python implementation:
import igraph
import numpy as np
def generate_fixed_gcc(n, p, target_gcc, tol=1E-3):
"""
Creates an Erdos-Renyi random graph of size n with a specified global
connection probability p, which is then iteratively rewired in order to
achieve a user- specified global clustering coefficient.
"""
# initialize random graph
G_best = igraph.Graph.Erdos_Renyi(n=n, p=p, directed=True, loops=False)
loss_best = 1.
n_edges = G_best.ecount()
# start with a high rewiring rate
rewiring_rate = n_edges
n_iter = 0
while loss_best > tol:
# operate on a copy of the current best graph
G = G_best.copy()
# adjust the number of connections to rewire according to the current
# best loss
n_rewire = min(max(int(rewiring_rate * loss_best), 1), n_edges)
G.rewire(n=n_rewire)
# compute the global clustering coefficient
gcc = G.transitivity_undirected()
loss = abs(gcc - target_gcc)
# did we improve?
if loss < loss_best:
# keep the new graph
G_best = G
loss_best = loss
gcc_best = gcc
# increase the rewiring rate
rewiring_rate *= 1.1
else:
# reduce the rewiring rate
rewiring_rate *= 0.9
n_iter += 1
# get adjacency matrix as a boolean numpy array
M = np.array(G_best.get_adjacency().data, dtype=np.bool)
return M, n_iter, gcc_best
This is works OK for small networks (N < 500), but it quickly becomes intractable as the number of nodes increases. It takes on the order of about 20 sec to generate a 200 node graph, and several days to generate a 1000 node graph.
Can anyone suggest an efficient way to do this?

You are right. That is a very expensive method to achieve what you want. I can only speculate if there is a mathematically sound way to optimize and ensure that it is close to being a uniform distribution. I'm not even sure that your method leads to a uniform distribution although it seems like it would. Let me try:
Based on the docs for transitivity_undirected and wikipedia Clustering Coefficient, it sounds like it is possible to make changes locally in the graph and at the same time know the exact effect on global connectivity and global clustering.
The global clustering coefficient is based on triplets of nodes. A triplet consists of three nodes that are connected by either two (open triplet) or three (closed triplet) undirected ties. A triangle consists of three closed triplets, one centred on each of the nodes. The global clustering coefficient is the number of closed triplets (or 3 x triangles) over the total number of triplets (both open and closed).
( * edit * ) Based on my reading of the paper referenced by ali_m, the method below will probably spend too many edges on low-degree clusters, leading to a graph that cannot achieve the desired clustering coefficient unless it is very low (which probably wouldn't be useful anyway). Therefore, on the off chance that somebody actually uses this, you will want to identify higher degree clusters to add edges to in order to quickly raise the clustering coefficient without needing to add a lot of edges.
On the other hand, the method below does align with the methods in the research paper so it's more or less a reasonable approach.
If I understand it correctly, you could do the following:
Produce the graph as you have done.
Calculate and Track:
p_surplus to track the number of edges that need to be added or removed elsewhere to maintain connectivity
cc_top, cc_btm to track the clustering coefficient
Iteratively (not completely) choose random pairs and connect or disconnect them to monotonically
approach the Clustering Coefficient (cc) you want while maintaining the Connectivity (p) you already have.
Pseudo code:
for random_pair in random_pairs:
if (random_pair is connected) and (need to reduce cc or p): # maybe put priority on the one that has a larger gap?
delete the edge
p_surplus -= 1
cc_top -= broken_connected_triplets # have to search locally
cc_btm -= (broken_connected_triplets + broken_open_triplets) # have to search locally
elif (random_pair is not connected) add (need to increase c or p):
add the edge
p_surplus += 1
cc_top += new_connected_triplets
cc_btm += (new_connected_triplets + new_open_triplets)
if cc and p are within desired ranges:
done
if some condition for detecting infinite loops:
rethink this method
That may not be totally correct, but I think the approach will work. The efficiency
of searching for local triplets and always moving your parameters in the right direction will be better
than copying the graph and globally measuring the cc so many times.

Having done a bit of reading, it looks as though the best solution might be the generalized version of Gleeson's algorithm presented in this paper. However, I still don't really understand how to implement it, so for the time being I've been working on Bansal et al's algorithm.
Like my naive approach, this is a Markov chain-based method that uses random edge swaps, but unlike mine it specifically targets 'triplet motifs' within the graph for rewiring:
Since this will have a greater tendency to introduce triangles, it will therefore have a greater impact on the clustering coefficient. At least in the case of undirected graphs, the rewiring step is also guaranteed to preserve the degree sequence. Again, on every rewiring iteration the new global clustering coefficient is measured, and the new graph is accepted if the GCC got closer to the target value.
Bansal et al actually provided a Python implementation, but for various reasons I ended up writing my own version, which you can find here.
Performance
The Bansal approach takes just over half the number of iterations and half the total time compared with my naive diffusion method:
I was hoping for bigger gains, but a 2x speedup is better than nothing.
Generalizing to directed graphs
One remaining challenge with the Bansal method is that my graphs are directed, whereas Bansal et al's algorithm is only designed to work on undirected graphs. With a directed graph, the rewiring step is no longer guaranteed to preserve the in- and out-degree sequences.
Update
I've just figured out how to generalize the Bansal method to preserve both the in- and out-degree sequences for directed graphs. The trick is to select motifs where the two outward edges to be swapped have opposite directions (the directions of the edges between {x, y1} and {x, y2} don't matter):
I've also made some more optimizations, and the performance is starting to look a bit more respectable - it takes roughly half the number of iterations and half the total time compared with the diffusion approach. I've updated the graphs above with the new timings.

I came up with a graph generation model that can easily generate connected random graphs of some 10,000 nodes and more that follow prescribed degree and (local) clustering coefficient distributions which can be chosen such that any desired global clustering coefficient results. You can find a short description here. By the way, you will find your question (this one) in the references.

Kolda et al. proposed the BTER model (2013) that can generate random graphs with prescribed degree and clustering coefficient distribution (and thus prescribed global clustering index). It seems a bit more complicated than my model (see above), but maybe it's faster or generates less biased graphs. (But to be honest, I assume that my model doesn't generate severely biased graphs, neither, but essentially random graphs.)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.