Merge several graphml-files with networkx and remove duplicates - python

I'm new to programming, Python and networkx (ouch!) and trying to merge four graphml-files into one and removing the duplicate nodes, following the excellent instructions here
However, I can't figure out how to keep track of the duplicate nodes when there are FOUR files to compare, instead of two. The code I've written below won't work, but you can hopefully see how I'm thinking wrong and help me.
# script to merge one or more graphml files into one single graphml file
# First read graphml-files into Python and Networkx (duplicate variables as necessary)
A = nx.read_graphml("file1.graphml")
B = nx.read_graphml("file2.graphml")
C = nx.read_graphml("file3.graphml")
D = nx.read_graphml("file4.graphml")
# Create a new graph variable containing all the previous graphs
H = nx.union(A,B,C,D, rename=('1-','2-','3-','4-'))
# Check what nodes are in two or more of the original graphml-files
duplicate_nodes_a_b = [n for n in A if n in B]
duplicate_nodes_b_c = [n for n in B if n in C]
duplicate_nodes_c_d = [n for n in C if n in D]
all_duplicate_nodes = # How should I get this?
# remove duplicate nodes
for n in all_duplicate nodes:
n1='1-'+str(n)
n2='2-'+str(n)
n3='3-'+str(n)
n4='4-'+str(n)
H.add_edges_from([(n1,nbr)for nbr in H[n2]]) # How can I take care of duplicates_nodes_b_c, duplicates_nodes_c_d?
H.remove_node(n2)
# write the merged graphml files-variable into a new merged graphml file
nx.write.graphml(H, "merged_file.graphml", encoding="utf-8", prettyprint=True)

First, note that the way you use nx.union is not what you want. You really need to call it with just two graphs. But how to deal with the duplicates gets complicated this way, because you have to consider all possible pairs of graphs to see how a node could be duplicated.
Instead, let's be more direct and just count up in how many graphs each node appears. This is easy using a Counter:
import collections
ctr = collections.Counter()
for G in [A, B, C, D]:
ctr.update(G)
Now determine which nodes just appear once, using the counter:
singles = {x for (x,n) in ctr.viewitems() if n == 1}
With that set of nodes, we can then compute the subgraphs containing only nodes that are not duplicated:
E = nx.union(A.subgraph(singles), B.subgraph(singles))
F = nx.union(C.subgraph(singles), D.subgraph(singles))
H = nx.union(E, F)
The graph H has all four initial graphs merged with duplicates removed.
The approach I've shown makes several intermediate graphs, so it is possible that, for large input graphs, you'll run into memory problems. If so, a similar approach could be done where you determine the set of duplicated nodes, delete those nodes from the original graphs, and then find the union without keeping all the intermediates. It looks like:
import collections
import networkx as nx
ctr = collections.Counter()
for G in [A, B, C, D]:
ctr.update(G)
duplicates = {x for (x,n) in ctr.viewitems() if n > 1}
H = nx.Graph()
for G in [A, B, C, D]:
G.remove_nodes_from(duplicates) # change graphs in-place
H = nx.union(H, G)
Both approaches take advantage of the way that NetworkX functions often allow extra nodes to be given and silently ignored.

If the graphml files are simple (no weights, properties, etc.), then it may be easier to work at the text level. For instance,
cat A.graphml B.graphml C.graphml | sort -r | uniq > D.graphml
This will keep unique sets of nodes and edges from three graphml files. You can rearrange <graph>, </graph>, <graphml ...>, </graphml> tags in D.graphml later with a text editor.

Related

How to generate all directed permutations of an undirected graph?

I am looking for a way to generate all possible directed graphs from an undirected template. For example, given this graph "template":
I want to generate all six of these directed versions:
In other words, for each edge in the template, choose LEFT, RIGHT, or BOTH direction for the resulting edge.
There is a huge number of outputs for even a small graph, because there are 3^E valid permutations (where E is the number of edges in the template graph), but many of them are duplicates (specifically, they are automorphic to another output). Take these two, for example:
I only need one.
I'm curious first: Is there is a term for this operation? This must be a formal and well-understood process already?
And second, is there a more efficient algorithm to produce this list? My current code (Python, NetworkX, though that's not important for the question) looks like this, which has two things I don't like:
I generate all permutations even if they are isomorphic to a previous graph
I check isomorphism at the end, so it adds additional computational cost
Results := Empty List
T := The Template (Undirected Graph)
For i in range(3^E):
Create an empty directed graph G
convert i to trinary
For each nth edge in T:
If the nth digit of i in trinary is 1:
Add the edge to G as (A, B)
If the nth digit of i in trinary is 2:
Add the edge to G as (B, A)
If the nth digit of i in trinary is 0:
Add the reversed AND forward edges to G
For every graph in Results:
If G is isomorphic to Results, STOP
Add G to Results

Retrieving original node names in Networkit

I am not sure I understand how Networkit handles the names of the nodes.
Let's say that I read a large graph from an edgelist, using another Python module like Networkx; then I convert it to a Network graph and I perform some operations, like computing the pairwise distances. A simple piece of code to do this could be:
import networkx as nx
import networkit as nk
nxG=nx.read_edgelist('test.edgelist',data=True)
G = nk.nxadapter.nx2nk(nxG, weightAttr='weight')
apsp = nk.distance.APSP(G)
apsp.run()
dist=apsp.getDistances()
easy-peasy.
Now, what if I want to do something with those distances? For examples, what if I want to plot them against, I don’t know, the weights on the paths, or any other measure that requires the retrieval of the original node ids?
The getDistances() function returns a list of lists, one for each node with the distance to every other node, but I have no clue on how Networkit maps the nodes’ names to the sequence of ints that it uses as nodes identifiers, thus the order it followed to compute the distances and store them in the output.
When creating a new graph from networkx, NetworKit creates a dictionary that maps each node id in nxG to an unique integer from 0 to n - 1 in G (where n is the number of nodes) with this instruction.
Unfortunately, this mapping is not returned by nx2nk, so you should create it yourself.
Let's assume that you want to get a distance from node 1 to node 2, where 1 and 2 are node ids in nxG:
import networkx as nx
import networkit as nk
nxG=nx.read_edgelist('test.edgelist',data=True)
G = nk.nxadapter.nx2nk(nxG, weightAttr='weight')
# Get mapping from node ids in nxG to node ids in G
idmap = dict((id, u) for (id, u) in zip(nxG.nodes(), range(nxG.number_of_nodes())))
apsp = nk.distance.APSP(G)
apsp.run()
dist=apsp.getDistances()
# Get distance from node `1` to node `2`
dist_from_1_to_2 = dist[idmap['1']][idmap['2']]

python networkx get unique matching combinations

I have a graph of nodes that are potential duplicates of items and I'm trying to find all possible combinations of matches. If two nodes are connected, that means they are potentially the same item, but no node can be matched more than once.
For example, if I take the following simple graph:
T = nx.Graph()
T.add_edge('A','B')
T.add_edge('A','C')
T.add_edge('B','D')
T.add_edge('D','A')
In this example my outputs could either be:
[{A:B},{A:C,B:D},{A:D}]
How can I develop a list of unique combinations? Some of the graphs have ~20 nodes, so brute forcing through all combinations is out.
It seems that what you are looking for is to find matchings of G, i.e., sets of edges where no two edges share a common vertex.
In particular, you are looking for maximal matchings of G.
Networkx offers the function maximal_matching. You may extend this function to obtain all the maximal matchings.
One way to do it may be the following. You start with a list of partial matchings, each made by an edge. Each partial matching is then extended until it becomes a maximal one, i.e., until it cannot be extended to a matching of larger cardinality.
If a partial matching m can be extended to a larger one using an edge (u,v), then m'=m ∪ {(u,v)} is added to the list of partial matchings. Otherwise, m is added to the list of maximal matchings.
The following code can be improved to be more efficient in many ways. One way is to check before adding to the list of partial matchings. indeed, the list will contain partial matchings which represent the same one (i.e., [{i,j},{u,v}] and [{u,v},{i,j}] ).
import networkx as nx
import itertools
def all_maximal_matchings(T):
maximal_matchings = []
partial_matchings = [{(u,v)} for (u,v) in T.edges()]
while partial_matchings:
# get current partial matching
m = partial_matchings.pop()
nodes_m = set(itertools.chain(*m))
extended = False
for (u,v) in T.edges():
if u not in nodes_m and v not in nodes_m:
extended = True
# copy m, extend it and add it to the list of partial matchings
m_extended = set(m)
m_extended.add((u,v))
partial_matchings.append(m_extended)
if not extended and m not in maximal_matchings:
maximal_matchings.append(m)
return maximal_matchings
T = nx.Graph()
T.add_edge('A','B')
T.add_edge('A','C')
T.add_edge('B','D')
T.add_edge('D','A')
print(all_maximal_matchings(T))

Shuffling a large network using Python

I have a large network to analyze. For example:
import networkx as nx
import random
BA = nx.random_graphs.barabasi_albert_graph(1000000, 3)
nx.info(BA)
I have to shuffle the edges while keeping the degree distribution unchanged. The basic idea was introduced by Maslov. Thus, my colleague and I wrote a shuffleNetwork function in which we work on a network object G for num times. edges is a list object.
The problem is this function runs too slow for large networks. I tried to use set or dict instead of list for the edges object (set and dict are hash table). However, since we also need to delete and add elements to it, the time complexity becomes even bigger.
Do you have any suggestions on further optimising this function?
def shuffleNetwork(G,Num):
edges=G.edges()
l=range(len(edges))
for n in range(Num):
i,j = random.sample(l, 2)
a,b=edges[i]
c,d=edges[j]
if a != d and c!= b:
if not (a,d) in edges or (d, a) in edges or (c,b) in edges or (b, c) in edges:
edges[i]=(a,d)
edges[j]=(c,b)
K=nx.from_edgelist(edges)
return K
import timeit
start = timeit.default_timer()
#Your statements here
gr = shuffleNetwork(BA, 1000)
stop = timeit.default_timer()
print stop - start
You should consider using nx.double_edge_swap
The documentation is here. It looks like it does exactly what you want, but modifies the graph in place.
I'm not sure whether it will solve the speed issues, but it does avoid generating the list, so I think it will do better than what you've got.
You would call it with nx.double_edge_swap(G,nswap=number)

Graph updating algorithm

I have a (un-directed) graph represented using adjacency lists, e.g.
a: b, c, e
b: a, d
c: a, d
d: b, c
e: a
where each node of the graph is linked to a list of other node(s)
I want to update such a graph given some new list(s) for certain node(s), e.g.
a: b, c, d
where a is no longer connected to e, and is connected to a new node d
What would be an efficient (both time and space wise) algorithm for performing such updates to the graph?
Maybe I'm missing something, but wouldn't it be fastest to use a dictionary (or default dict) of node-labels (strings or numbers) to sets? In this case update could look something like this:
def update(graph, node, edges, undirected=True):
# graph: dict(str->set(str)), node: str, edges: set(str), undirected: bool
if undirected:
for e in graph[node]:
graph[e].remove(node)
for e in edges:
graph[e].add(node)
graph[node] = edges
Using sets and dicts, adding and removing the node to/from the edges-sets of the other nodes should be O(1), same as updating the edges-set for the node itself, so this should be only O(2n) for the two loops, with n being the average number of edges of a node.
Using an adjacency grid would make it O(n) to update, but would take n^2 space, regardless of how sparse the graph is. (Trivially done by updating each changed relationship by inverting the row and column.)
Using lists would put the time up to O(n^2) for updating, but for sparse graphs would not take a huge time penalty, and would save a lot of space.
A typical update is del edge a,e; add edge a,d, but your update looks like a new adjacency list for vertex a. So simply find the a adjacency list and replace it. That should be O(log n) time (assuming sorted array of adjacency lists, like in your description).

Categories