I have a graph of nodes that are potential duplicates of items and I'm trying to find all possible combinations of matches. If two nodes are connected, that means they are potentially the same item, but no node can be matched more than once.
For example, if I take the following simple graph:
T = nx.Graph()
T.add_edge('A','B')
T.add_edge('A','C')
T.add_edge('B','D')
T.add_edge('D','A')
In this example my outputs could either be:
[{A:B},{A:C,B:D},{A:D}]
How can I develop a list of unique combinations? Some of the graphs have ~20 nodes, so brute forcing through all combinations is out.
It seems that what you are looking for is to find matchings of G, i.e., sets of edges where no two edges share a common vertex.
In particular, you are looking for maximal matchings of G.
Networkx offers the function maximal_matching. You may extend this function to obtain all the maximal matchings.
One way to do it may be the following. You start with a list of partial matchings, each made by an edge. Each partial matching is then extended until it becomes a maximal one, i.e., until it cannot be extended to a matching of larger cardinality.
If a partial matching m can be extended to a larger one using an edge (u,v), then m'=m ∪ {(u,v)} is added to the list of partial matchings. Otherwise, m is added to the list of maximal matchings.
The following code can be improved to be more efficient in many ways. One way is to check before adding to the list of partial matchings. indeed, the list will contain partial matchings which represent the same one (i.e., [{i,j},{u,v}] and [{u,v},{i,j}] ).
import networkx as nx
import itertools
def all_maximal_matchings(T):
maximal_matchings = []
partial_matchings = [{(u,v)} for (u,v) in T.edges()]
while partial_matchings:
# get current partial matching
m = partial_matchings.pop()
nodes_m = set(itertools.chain(*m))
extended = False
for (u,v) in T.edges():
if u not in nodes_m and v not in nodes_m:
extended = True
# copy m, extend it and add it to the list of partial matchings
m_extended = set(m)
m_extended.add((u,v))
partial_matchings.append(m_extended)
if not extended and m not in maximal_matchings:
maximal_matchings.append(m)
return maximal_matchings
T = nx.Graph()
T.add_edge('A','B')
T.add_edge('A','C')
T.add_edge('B','D')
T.add_edge('D','A')
print(all_maximal_matchings(T))
Related
Hello I'm using networkx library, I have created graph but the i'm having issue in finding multiple targets and target values are bit tricky because target has to be matched with substring within the given target value.
Example:
Nodes = ['C0111', 'N6186', 'C5572', 'N6501', 'C0850-IASW-NO01', 'C1182-IUPE-NO01']
Edges = [('C0111','N6186'),('N6186','C0850-IASW-NO01'),('C0111','C5572'),('C5572','N6501'),('N6501','C1182-IUPE-NO01')]
Problem:
Source = 'C0111'
Target = ['IASW','IUPE']
Their are some special nodes which are considered as target which are 8 of them including nodes containing 'IUPE' , 'IASW' ,etc
I can create graph using networkx.
import networkx as nx
G = nx.Graph()
G.add_nodes_from(Nodes)
G.add_edges_from(Edges)
nx.shortest_path(G,source='C0111',target=?)'''
for multiple targets i can iterate through multi targets but for substring to be in node i'm confused on this point.
example:
normal way ==> '''nx.shortest_path(G,source='C0111',target='C0850-IASW-NO01')'''
'C0850-IASW-NO01' => thats how node is created
but i want to see if target has IASW or IUPE in it.
One solution is to use the pattern to subset the target nodes before looking for the shortest paths:
target_nodes = [n for n in G if "IASW" in str(n) or "IUPE" in str(n)]
With a list of target nodes, now it's possible to iterate over them and find the shortest path of interest (as you describe).
I am looking for a way to generate all possible directed graphs from an undirected template. For example, given this graph "template":
I want to generate all six of these directed versions:
In other words, for each edge in the template, choose LEFT, RIGHT, or BOTH direction for the resulting edge.
There is a huge number of outputs for even a small graph, because there are 3^E valid permutations (where E is the number of edges in the template graph), but many of them are duplicates (specifically, they are automorphic to another output). Take these two, for example:
I only need one.
I'm curious first: Is there is a term for this operation? This must be a formal and well-understood process already?
And second, is there a more efficient algorithm to produce this list? My current code (Python, NetworkX, though that's not important for the question) looks like this, which has two things I don't like:
I generate all permutations even if they are isomorphic to a previous graph
I check isomorphism at the end, so it adds additional computational cost
Results := Empty List
T := The Template (Undirected Graph)
For i in range(3^E):
Create an empty directed graph G
convert i to trinary
For each nth edge in T:
If the nth digit of i in trinary is 1:
Add the edge to G as (A, B)
If the nth digit of i in trinary is 2:
Add the edge to G as (B, A)
If the nth digit of i in trinary is 0:
Add the reversed AND forward edges to G
For every graph in Results:
If G is isomorphic to Results, STOP
Add G to Results
I am not sure I understand how Networkit handles the names of the nodes.
Let's say that I read a large graph from an edgelist, using another Python module like Networkx; then I convert it to a Network graph and I perform some operations, like computing the pairwise distances. A simple piece of code to do this could be:
import networkx as nx
import networkit as nk
nxG=nx.read_edgelist('test.edgelist',data=True)
G = nk.nxadapter.nx2nk(nxG, weightAttr='weight')
apsp = nk.distance.APSP(G)
apsp.run()
dist=apsp.getDistances()
easy-peasy.
Now, what if I want to do something with those distances? For examples, what if I want to plot them against, I don’t know, the weights on the paths, or any other measure that requires the retrieval of the original node ids?
The getDistances() function returns a list of lists, one for each node with the distance to every other node, but I have no clue on how Networkit maps the nodes’ names to the sequence of ints that it uses as nodes identifiers, thus the order it followed to compute the distances and store them in the output.
When creating a new graph from networkx, NetworKit creates a dictionary that maps each node id in nxG to an unique integer from 0 to n - 1 in G (where n is the number of nodes) with this instruction.
Unfortunately, this mapping is not returned by nx2nk, so you should create it yourself.
Let's assume that you want to get a distance from node 1 to node 2, where 1 and 2 are node ids in nxG:
import networkx as nx
import networkit as nk
nxG=nx.read_edgelist('test.edgelist',data=True)
G = nk.nxadapter.nx2nk(nxG, weightAttr='weight')
# Get mapping from node ids in nxG to node ids in G
idmap = dict((id, u) for (id, u) in zip(nxG.nodes(), range(nxG.number_of_nodes())))
apsp = nk.distance.APSP(G)
apsp.run()
dist=apsp.getDistances()
# Get distance from node `1` to node `2`
dist_from_1_to_2 = dist[idmap['1']][idmap['2']]
I have a list with two elements like this:
list_a = [27.666521, 85.437447]
and another list like this:
big_list = [[27.666519, 85.437477], [27.666460, 85.437622], ...]
And I want to find the closest match of list_a within list_b.
For example, here the closest match would be [27.666519, 85.437477].
How would I be able to achieve this?
I found a similar problem here for finding the closest match of a string in an array but was unable to reproduce it similarly for the above mentioned problem.
P.S.The elements in the list are the co-ordinates of points on the earth.
From your question, it's hard to tell how you want to measure the distance, so I simply assume you mean Euclidean distance.
You can use the key parameter to min():
from functools import partial
def distance_squared(x, y):
return (x[0] - y[0])**2 + (x[1] - y[1])**2
print min(big_list, key=partial(distance_squared, list_a))
Assumptions:
You intend to make this type query more than once on the same list of lists
Both the query list and the lists in your list of lists represent points in a n-dimensional euclidean space (here: a 2-dimensional space, unlike GPS positions that come from a spherical space).
This reads like a nearest neighbor search. Probably you should take into consideration a library dedicated for this, like scikits.ann.
Example:
import scikits.ann as ann
import numpy as np
k = ann.kdtree(np.array(big_list))
indices, distances = k.knn(list_a, 1)
This uses euclidean distance internally. You should make sure, that the distance measure you apply complies your idea of proximity.
You might also want to have a look on Quadtree, which is another data structure that you could apply to avoid the brute force minimum search through your entire list of lists.
I am writing a piece of code which models the evolution of a social network. The idea is that each person is assigned to a node and relationships between people (edges on the network) are given a weight of +1 or -1 depending on whether the relationship is friendly or unfriendly.
Using this simple model you can say that a triad of three people is either "balanced" or "unbalanced" depending on whether the product of the edges of the triad is positive or negative.
So finally what I am trying to do is implement an ising type model. I.e. Random edges are flipped and the new relationship is kept if the new network has more balanced triangels (a lower energy) than the network before the flip, if that is not the case then the new relationship is only kept with a certain probability.
Ok so finally onto my question: I have written the following code, however the dataset I have contains ~120k triads, as a result it will take 4 days to run!
Could anyone offer any tips on how I might optimise the code?
Thanks.
#Importing required librarys
try:
import matplotlib.pyplot as plt
except:
raise
import networkx as nx
import csv
import random
import math
def prod(iterable):
p= 1
for n in iterable:
p *= n
return p
def Sum(iterable):
p= 0
for n in iterable:
p += n[3]
return p
def CalcTriads(n):
firstgen=G.neighbors(n)
Edges=[]
Triads=[]
for i in firstgen:
Edges.append(G.edges(i))
for i in xrange(len(Edges)):
for j in range(len(Edges[i])):# For node n go through the list of edges (j) for the neighboring nodes (i)
if set([Edges[i][j][1]]).issubset(firstgen):# If the second node on the edge is also a neighbor of n (its in firstgen) then keep the edge.
t=[n,Edges[i][j][0],Edges[i][j][1]]
t.sort()
Triads.append(t)# Add found nodes to Triads.
new_Triads = []# Delete duplicate triads.
for elem in Triads:
if elem not in new_Triads:
new_Triads.append(elem)
Triads = new_Triads
for i in xrange(len(Triads)):# Go through list of all Triads finding the weights of their edges using G[node1][node2]. Multiply the three weights and append value to each triad.
a=G[Triads[i][0]][Triads[i][1]].values()
b=G[Triads[i][1]][Triads[i][2]].values()
c=G[Triads[i][2]][Triads[i][0]].values()
Q=prod(a+b+c)
Triads[i].append(Q)
return Triads
###### Import sorted edge data ######
li=[]
with open('Sorted Data.csv', 'rU') as f:
reader = csv.reader(f)
for row in reader:
li.append([float(row[0]),float(row[1]),float(row[2])])
G=nx.Graph()
G.add_weighted_edges_from(li)
for i in xrange(800000):
e = random.choice(li) # Choose random edge
TriNei=[]
a=CalcTriads(e[0]) # Find triads of first node in the chosen edge
for i in xrange(0,len(a)):
if set([e[1]]).issubset(a[i]): # Keep triads which contain the whole edge (i.e. both nodes on the edge)
TriNei.append(a[i])
preH=-Sum(TriNei) # Save the "energy" of all the triads of which the edge is a member
e[2]=-1*e[2]# Flip the weight of the random edge and create a new graph with the flipped edge
G.clear()
G.add_weighted_edges_from(li)
TriNei=[]
a=CalcTriads(e[0])
for i in xrange(0,len(a)):
if set([e[1]]).issubset(a[i]):
TriNei.append(a[i])
postH=-Sum(TriNei)# Calculate the post flip "energy".
if postH<preH:# If the post flip energy is lower then the pre flip energy keep the change
continue
elif random.random() < 0.92: # If the post flip energy is higher then only keep the change with some small probability. (0.92 is an approximate placeholder for exp(-DeltaH)/exp(1) at the moment)
e[2]=-1*e[2]
The following suggestions won't boost your performance that much because they are not on the algorithmic level, i.e. not very specific to your problem. However, they are generic suggestions for slight performance improvements:
Unless you are using Python 3, change
for i in range(800000):
to
for i in xrange(800000):
The latter one just iterates numbers from 0 to 800000, the first one creates a huge list of numbers and then iterates that list. Do something similar for the other loops using range.
Also, change
j=random.choice(range(len(li)))
e=li[j] # Choose random edge
to
e = random.choice(li)
and use e instead of li[j] subsequently. If you really need a index number, use random.randint(0, len(li)-1).
There are syntactic changes you can make to speed things up, such as replacing your Sum and Prod functions with the built-in equivalents sum(x[3] for x in iterable) and reduce(operator.mul, iterable) - it is generally faster to use builtin functions or generator expressions than explicit loops.
As far as I can tell the line:
if set([e[1]]).issubset(a[i]): # Keep triads which contain the whole edge (i.e. both nodes on the edge)
is testing if a float is in a list of floats. Replacing it with if e[1] in a[i]: will remove the overhead of creating two set objects for each comparison.
Incidentally, you do not need to loop through the index values of an array, if you are only going to use that index to access the elements. e.g. replace
for i in range(0,len(a)):
if set([e[1]]).issubset(a[i]): # Keep triads which contain the whole edge (i.e. both nodes on the edge)
TriNei.append(a[i])
with
for x in a:
if set([e[1]]).issubset(x): # Keep triads which contain the whole edge (i.e. both nodes on the edge)
TriNei.append(x)
However I suspect that changes like this will not make a big difference to the overall runtime. To do that you either need to use a different algorithm or switch to a faster language. You could try running it in pypy - for some cases it can be significantly faster than CPython. You could also try cython, which will compile your code to C and can sometimes give a big performance gain especially if you annotate your code with cython type information. I think the biggest improvement may come from changing the algorithm to one that does less work, but I don't have any suggestions for that.
BTW, why loop 800000 times? What is the significance of that number?
Also, please use meaningful names for your variables. Using single character names or shrtAbbrv does not speed the code up at all, and makes it very hard to follow what it is doing.
There are quite a few things you can improve here. Start by profiling your program using a tool like cProfile. This will tell you where most of the program's time is being spent and thus where optimization is likely to be most helpful. As a hint, you don't need to generate all the triads at every iteration of the program.
You also need to fix your indentation before you can expect a decent answer.
Regardless, this question might be better suited to Code Review.
I'm not sure I understand exactly what you are aiming for, but there are at least two changes that might help. You probably don't need to destroy and create the graph every time in the loop since all you are doing is flipping one edge weight sign. And the computation to find the triangles can be improved.
Here is some code that generates a complete graph with random weights, picks a random edge in a loop, finds the triads and flips the edge weight...
import random
import networkx as nx
# complete graph with random 1/-1 as weight
G=nx.complete_graph(5)
for u,v,d in G.edges(data=True):
d['weight']=random.randrange(-1,2,2) # -1 or 1
edges=G.edges()
for i in range(10):
u,v = random.choice(edges) # random edge
nbrs = set(G[u]) & set(G[v]) - set([u,v]) # nodes in traids
triads = [(u,v,n) for n in nbrs]
print "triads",triads
for u,v,w in triads:
print (u,v,G[u][v]['weight']),(u,w,G[u][w]['weight']),(v,w,G[v][w]['weight'])
G[u][v]['weight']*=-1