I am writing a piece of code which models the evolution of a social network. The idea is that each person is assigned to a node and relationships between people (edges on the network) are given a weight of +1 or -1 depending on whether the relationship is friendly or unfriendly.
Using this simple model you can say that a triad of three people is either "balanced" or "unbalanced" depending on whether the product of the edges of the triad is positive or negative.
So finally what I am trying to do is implement an ising type model. I.e. Random edges are flipped and the new relationship is kept if the new network has more balanced triangels (a lower energy) than the network before the flip, if that is not the case then the new relationship is only kept with a certain probability.
Ok so finally onto my question: I have written the following code, however the dataset I have contains ~120k triads, as a result it will take 4 days to run!
Could anyone offer any tips on how I might optimise the code?
Thanks.
#Importing required librarys
try:
import matplotlib.pyplot as plt
except:
raise
import networkx as nx
import csv
import random
import math
def prod(iterable):
p= 1
for n in iterable:
p *= n
return p
def Sum(iterable):
p= 0
for n in iterable:
p += n[3]
return p
def CalcTriads(n):
firstgen=G.neighbors(n)
Edges=[]
Triads=[]
for i in firstgen:
Edges.append(G.edges(i))
for i in xrange(len(Edges)):
for j in range(len(Edges[i])):# For node n go through the list of edges (j) for the neighboring nodes (i)
if set([Edges[i][j][1]]).issubset(firstgen):# If the second node on the edge is also a neighbor of n (its in firstgen) then keep the edge.
t=[n,Edges[i][j][0],Edges[i][j][1]]
t.sort()
Triads.append(t)# Add found nodes to Triads.
new_Triads = []# Delete duplicate triads.
for elem in Triads:
if elem not in new_Triads:
new_Triads.append(elem)
Triads = new_Triads
for i in xrange(len(Triads)):# Go through list of all Triads finding the weights of their edges using G[node1][node2]. Multiply the three weights and append value to each triad.
a=G[Triads[i][0]][Triads[i][1]].values()
b=G[Triads[i][1]][Triads[i][2]].values()
c=G[Triads[i][2]][Triads[i][0]].values()
Q=prod(a+b+c)
Triads[i].append(Q)
return Triads
###### Import sorted edge data ######
li=[]
with open('Sorted Data.csv', 'rU') as f:
reader = csv.reader(f)
for row in reader:
li.append([float(row[0]),float(row[1]),float(row[2])])
G=nx.Graph()
G.add_weighted_edges_from(li)
for i in xrange(800000):
e = random.choice(li) # Choose random edge
TriNei=[]
a=CalcTriads(e[0]) # Find triads of first node in the chosen edge
for i in xrange(0,len(a)):
if set([e[1]]).issubset(a[i]): # Keep triads which contain the whole edge (i.e. both nodes on the edge)
TriNei.append(a[i])
preH=-Sum(TriNei) # Save the "energy" of all the triads of which the edge is a member
e[2]=-1*e[2]# Flip the weight of the random edge and create a new graph with the flipped edge
G.clear()
G.add_weighted_edges_from(li)
TriNei=[]
a=CalcTriads(e[0])
for i in xrange(0,len(a)):
if set([e[1]]).issubset(a[i]):
TriNei.append(a[i])
postH=-Sum(TriNei)# Calculate the post flip "energy".
if postH<preH:# If the post flip energy is lower then the pre flip energy keep the change
continue
elif random.random() < 0.92: # If the post flip energy is higher then only keep the change with some small probability. (0.92 is an approximate placeholder for exp(-DeltaH)/exp(1) at the moment)
e[2]=-1*e[2]
The following suggestions won't boost your performance that much because they are not on the algorithmic level, i.e. not very specific to your problem. However, they are generic suggestions for slight performance improvements:
Unless you are using Python 3, change
for i in range(800000):
to
for i in xrange(800000):
The latter one just iterates numbers from 0 to 800000, the first one creates a huge list of numbers and then iterates that list. Do something similar for the other loops using range.
Also, change
j=random.choice(range(len(li)))
e=li[j] # Choose random edge
to
e = random.choice(li)
and use e instead of li[j] subsequently. If you really need a index number, use random.randint(0, len(li)-1).
There are syntactic changes you can make to speed things up, such as replacing your Sum and Prod functions with the built-in equivalents sum(x[3] for x in iterable) and reduce(operator.mul, iterable) - it is generally faster to use builtin functions or generator expressions than explicit loops.
As far as I can tell the line:
if set([e[1]]).issubset(a[i]): # Keep triads which contain the whole edge (i.e. both nodes on the edge)
is testing if a float is in a list of floats. Replacing it with if e[1] in a[i]: will remove the overhead of creating two set objects for each comparison.
Incidentally, you do not need to loop through the index values of an array, if you are only going to use that index to access the elements. e.g. replace
for i in range(0,len(a)):
if set([e[1]]).issubset(a[i]): # Keep triads which contain the whole edge (i.e. both nodes on the edge)
TriNei.append(a[i])
with
for x in a:
if set([e[1]]).issubset(x): # Keep triads which contain the whole edge (i.e. both nodes on the edge)
TriNei.append(x)
However I suspect that changes like this will not make a big difference to the overall runtime. To do that you either need to use a different algorithm or switch to a faster language. You could try running it in pypy - for some cases it can be significantly faster than CPython. You could also try cython, which will compile your code to C and can sometimes give a big performance gain especially if you annotate your code with cython type information. I think the biggest improvement may come from changing the algorithm to one that does less work, but I don't have any suggestions for that.
BTW, why loop 800000 times? What is the significance of that number?
Also, please use meaningful names for your variables. Using single character names or shrtAbbrv does not speed the code up at all, and makes it very hard to follow what it is doing.
There are quite a few things you can improve here. Start by profiling your program using a tool like cProfile. This will tell you where most of the program's time is being spent and thus where optimization is likely to be most helpful. As a hint, you don't need to generate all the triads at every iteration of the program.
You also need to fix your indentation before you can expect a decent answer.
Regardless, this question might be better suited to Code Review.
I'm not sure I understand exactly what you are aiming for, but there are at least two changes that might help. You probably don't need to destroy and create the graph every time in the loop since all you are doing is flipping one edge weight sign. And the computation to find the triangles can be improved.
Here is some code that generates a complete graph with random weights, picks a random edge in a loop, finds the triads and flips the edge weight...
import random
import networkx as nx
# complete graph with random 1/-1 as weight
G=nx.complete_graph(5)
for u,v,d in G.edges(data=True):
d['weight']=random.randrange(-1,2,2) # -1 or 1
edges=G.edges()
for i in range(10):
u,v = random.choice(edges) # random edge
nbrs = set(G[u]) & set(G[v]) - set([u,v]) # nodes in traids
triads = [(u,v,n) for n in nbrs]
print "triads",triads
for u,v,w in triads:
print (u,v,G[u][v]['weight']),(u,w,G[u][w]['weight']),(v,w,G[v][w]['weight'])
G[u][v]['weight']*=-1
Related
I feel this should be simple but I'm stuck on finding a neat solution. The code I have provided works, and gives the output I expect, but I don't feel it is Pythonic and it's getting on my nerves.
I have produced three sets of coordinates, X, Y & Z using 'griddata' from a base data set. The coordinates are evenly spaced over an unknown total area / shape (not necessarily square / rectangle) producing the NaN results which I want to ignore of the boundaries of each list. The list should be traversed from the 'bottom left' (in a coordinate system), across the x axis, up one space in the y direction then right to left before continuing. There could be an odd or even number of rows.
The operation to be performed on each point is the same no matter the direction, and it is guaranteed that the every point which exists in X a point exists in Y and Z as can be seen in the code below.
Arrays (lists?) are of the format DataPoint[rows][columns].
k = 0
for i in range(len(x)):
if k % 2 == 0: # cut left to right, then right to left
for j in range(len(x[i])):
if not numpy.isnan(x[i][j]):
file.write(f'X{x[i][j]} Y{y[i][j]} Z{z[i][j]}')
else:
for j in reversed(range(len(x[i]))):
if not numpy.isnan(x[i][j]):
file.write(f'X{x[i][j]} Y{y[i][j]} Z{z[i][j]}')
k += 1
One solution I could think of would be to reverse every other row in each of the lists before running the loop. It would save me a few lines, but probably wouldn't make sense from a performance standpoint - anyone have any better suggestions?
Expected route through list:
End════<══════╗
╔══════>══════╝
╚══════<══════╗
Start══>══════╝
Here's a variant:
for i, (x_row, y_row, z_row) in enumerate(zip(x, y, z)):
if i % 2:
z_row = reversed(x_row)
y_row = reversed(y_row)
z_row = reversed(z_row)
row_strs = list()
for x_elem, y_elem, z_elem in zip(x_row, y_row, z_row):
if not numpy.isnan(x_elem):
row_strs.append(f"X{x_elem} Y{y_elem} Z{z_elem}")
file.write("".join(row_strs))
Considerations:
There is no recipe for an optimization that will always perform better than any other. It also depends on the data that the code handles. Here's a list of things that I could think of, without knowing how the data looks like:
for index range(len(sequence)): is not a Pythonic way of iterating. Here, the foreach idiom is used. If the index is required, [Python 3.Docs]: Built-in Functions - enumerate(iterable, start=0) could be used
This no longer applies because of the previous bullet, but reversed(range(n)) is same as range(n - 1, -1, -1). Don't know whether the latter is faster, but it looks like it would be
Iterate over multiple iterables at once, using [Python 3.Docs]: Built-in Functions - zip(*iterables)
Don't need k, already have i
In general when working with files, it's better to read / write fewer times bigger chunks of data than many times smaller chunks of data (files generally reside on disk and disk operations are slow). However, buffering occurs by default (at Python, OS levels), so this is no longer an issue, but still. But again as always, it's a trade-off between resources (time, memory, ...). I chose to write to file once per line (rather than once per element - as it was originally). Of course, there's the 3rd possibility of writing everything at once, but I imagined that for larger data sets, it won't be the best solution
Probably, some optimizations could also happen at NumPy level (as it would handle bulk data much faster than Python code (iterating) does), but I'm not an expert in that area, nor do I know how the data looks like
I agree with #Prune, your code looks readable and does what it should do. You could compress it a bit by precomputing the indices, like so (note that this start from the top left):
import numpy as np
# generate some sample data
x = np.arange(100).reshape(10,10)
#precompute both directions
fancyranges = (
list(range(len(x[0,:]))),
reversed(list(range(len(x[0,:]))))
)
for a in range(x.shape[0]):
# call appropriate directions
for b in fancyranges[a%2]:
# do things
print(x[a,b])
you can move repeatable code to sub_func for further changes in one place
def func():
def sub_func():
# repeatable code
if not numpy.isnan(x[i][j]):
print(f'X{x[i][j]}...')
k = 0
for i in range(len(x)):
if k % 2 == 0: # cut left to right, then right to left
for j in range(len(x[i])):
sub_func()
else:
for j in reversed(range(len(x[i]))):
sub_func()
k += 1
func()
I need to find the closest possible sentence.
I have an array of sentences and a user sentence, and I need to find the closest to the user's sentence element of the array.
I presented each sentence in the form of a vector using word2vec:
def get_avg_vector(word_list, model_w2v, size=500):
sum_vec = np.zeros(shape = (1, size))
count = 0
for w in word_list:
if w in model_w2v and w != '':
sum_vec += model_w2v[w]
count +=1
if count == 0:
return sum_vec
else:
return sum_vec / count + 1
As a result, the array element looks like this:
array([[ 0.93162371, 0.95618944, 0.98519795, 0.98580566, 0.96563747,
0.97070891, 0.99079191, 1.01572807, 1.00631016, 1.07349398,
1.02079309, 1.0064849 , 0.99179418, 1.02865136, 1.02610303,
1.02909719, 0.99350413, 0.97481178, 0.97980362, 0.98068508,
1.05657591, 0.97224562, 0.99778703, 0.97888296, 1.01650529,
1.0421448 , 0.98731804, 0.98349052, 0.93752996, 0.98205837,
1.05691232, 0.99914532, 1.02040555, 0.99427229, 1.01193818,
0.94922226, 0.9818139 , 1.03955 , 1.01252615, 1.01402485,
...
0.98990598, 0.99576604, 1.0903802 , 1.02493086, 0.97395976,
0.95563786, 1.00538653, 1.0036294 , 0.97220088, 1.04822631,
1.02806122, 0.95402776, 1.0048053 , 0.97677222, 0.97830801]])
I represent the sentence of the user also as a vector, and I compute the closest element to it is like this:
%%cython
from scipy.spatial.distance import euclidean
def compute_dist(v, list_sentences):
dist_dict = {}
for key, val in list_sentences.items():
dist_dict[key] = euclidean(v, val)
return sorted(dist_dict.items(), key=lambda x: x[1])[0][0]
list_sentences in the method above is a dictionary in which keys are a text representation of sentences, and values are vector.
It takes a very long time, because I have more than 60 million sentences.
How can I speed up, optimize this process?
I'll be grateful for any advice.
The initial calculation of the 60 million sentences' vectors is essentially a fixed cost you'll pay once. I'm assuming you mainly care about the time for each subsequent lookup, for a single user-supplied query sentence.
Using numpy native array operations can speed up the distance calculations over doing your own individual calculations in a Python loop. (It's able to do things in bulk using its optimized code.)
But first you'd want to replace list_sentences with a true numpy array, accessed only by array-index. (If you have other keys/texts you need to associate with each slot, you'd do that elsewhere, with some dict or list.)
Let's assume you've done that, in whatever way is natural for your data, and now have array_sentences, a 60-million by 500-dimension numpy array, with one sentence average vector per row.
Then a 1-liner way to get an array full of the distances is as the vector-length ("norm") of the difference between each of the 60 million candidates and the 1 query (which gives a 60-million entry answer with each of the differences):
dists = np.linalg.norm(array_sentences - v)
Another 1-liner way is to use the numpy utility function cdist() for comuting distance between each pair of two collections of inputs. Here, your first collection is just the one query vector v (but if you had batches to do at once, supplying more than one query at a time could offer an additional slight speedup):
dists = np.linalg.cdists(array[v], array_sentences)
(Note that such vector comparisons often use cosine-distance/cosine-similarity rather than euclidean-distance. If you switch to that, you might be doing other norming/dot-products instead of the first option above, or use the metric='cosine' option to cdist().)
Once you have all the distances in a numpy array, using a numpy-native sort option is likely to be faster than using Python sorted(). For example, numpy's indirect sort argsort(), which just returns the sorted indexes (and thus avoids moving all the vector coordinates-around), since you just want to know which items are the best match(es). For example:
sorted_indexes = argsort(dists)
best_index = sorted_indexes[0]
If you need to turn that int index back into your other key/text, you'd use your own dict/list that remembered the slot-to-key relationships.
All these still give an exactly right result, by comparing against all candidates, which (even when done optimally well) is still time-consuming.
There are ways to get faster results, based on pre-building indexes to the full set of candidates – but such indexes become very tricky in high-dimensional spaces (like your 500-dimensional space). They often trade off perfectly accurate results for faster results. (That is, what they return for 'closest 1' or 'closest N' will have some errors, but usually not be off by much.) For examples of such libraries, see Spotify's ANNOY or Facebook's FAISS.
At least if you are doing this procedure for multiple sentences, you could try using scipy.spatial.cKDTree (I don't know whether it pays for itself on a single query. Also 500 is quite high, I seem to remember KDTrees work better for not quite as many dimensions. You'll have to experiment).
Assuming you've put all your vectors (dict values) into one large numpy array:
>>> import numpy as np
>>> from scipy.spatial import cKDTree as KDTree
>>>
# 100,000 vectors (that's all my RAM can take)
>>> a = np.random.random((100000, 500))
>>>
>>> t = KDTree(a)
# create one new vector and find distance and index of closest
>>> t.query(np.random.random(500))
(8.20910072933986, 83407)
I can think about 2 possible ways of optimizing this process.
First, if your goal is only to get the closest vector (or sentence), you could get rid of the list_sentences variable and only keep in memory the closest sentence you have found yet. This way, you won't need to sort the complete (and presumably very large) list at the end, and only return the closest one.
def compute_dist(v, list_sentences):
min_dist = 0
for key, val in list_sentences.items():
dist = euclidean(v, val)
if dist < min_dist:
closest_sentence = key
min_dist = dist
return closest_sentence
The second one is maybe a little more unsound. You can try to re implement the euclidean method by giving it a third argument which would be the current minimum distance min_dist between the closest vector you have found so far and the user vector. I don't know how the scipy euclidean method is implemented but I guess it is close to summing squared differences along all the vectors dimensions. What you want is the method to stop if the sum is higher than min_dist (the distance will be higher than min_dist anyway and you won't keep it).
I carefully read the docs, but it still is unclear to me how to use G.forEdges(), described as an "experimental edge iterator interface".
Let's say that I want to decrease the density of my graph. I have a sorted list of weights, and I want to remove edges based on their weight until the graph splits into two connected components. Then I'll select the minimum number of links that keeps the graph connected. I would do something like this:
cc = components.ConnectedComponents(G).run()
while cc.numberOfComponents()==1:
for weight in weightlist:
for (u,v) in G.edges():
if G.weight(u,v)==weight:
G=G.removeEdge(u,v)
By the way I know from the docs that there is this edge iterator, which probably does the iteration in a more efficient way. But from the docs I really can't understand how to correctly use this forEdges, and I can't find a single example over the internet. Any ideas?
Or maybe an alternative idea to do what I want to do: since it's a huge graph (125millions links) the iteration will take forever, even if I am working on a cluster.
NetworKit iterators accept a callback function so if you want to iterate over edges (or nodes) you have to define a function and then pass it to the iterator as a parameter. You can find more information here. For example a simple function that just prints all edges is:
# Callback function.
# To iterate over edges it must accept 4 parameters
def myFunction(u, v, weight, edgeId):
print("Edge from {} to {} has weight {} and id {}".format(u, v, weight, edgeId))
# Using iterator with callback function
G.forEdges(myFunction)
Now if you want to keep removing edges whose weight is inside your weightlist until the graph splits into two connected components you also have to update the connected components of the graph since ConnectedComponents will not do that for you automatically (this may be also one of the reasons why the iteration takes forever). To do this efficiently, you can use the DynConnectedComponents class (see my example below). In this case, I think that the edge iterator will not help you much so I would suggest you to keep using the for loop.
from networkit import *
# Efficiently updates connected components after edge updates
cc = components.DynConnectedComponents(G).run()
# Removes edges with weight equals to w until components split
def removeEdges(w):
for (u, v) in G.edges():
if G.weight(u, v) == weight:
G.removeEdge(u, v)
# Updating connected components
event = dynamic.GraphEvent(dynamic.GraphEvent.EDGE_REMOVAL, u, v, weight)
cc.update(event)
if cc.numberOfComponents() > 1:
# Components did split
return True
# Components did not split
return False
if cc.numberOfComponents() == 1:
for weight in weights:
if removeEdges(weight):
break
This should speed up a bit your original code. However, it is still sequential code so even if you run it on a multi-core machine it will use only one core.
I have a large network to analyze. For example:
import networkx as nx
import random
BA = nx.random_graphs.barabasi_albert_graph(1000000, 3)
nx.info(BA)
I have to shuffle the edges while keeping the degree distribution unchanged. The basic idea was introduced by Maslov. Thus, my colleague and I wrote a shuffleNetwork function in which we work on a network object G for num times. edges is a list object.
The problem is this function runs too slow for large networks. I tried to use set or dict instead of list for the edges object (set and dict are hash table). However, since we also need to delete and add elements to it, the time complexity becomes even bigger.
Do you have any suggestions on further optimising this function?
def shuffleNetwork(G,Num):
edges=G.edges()
l=range(len(edges))
for n in range(Num):
i,j = random.sample(l, 2)
a,b=edges[i]
c,d=edges[j]
if a != d and c!= b:
if not (a,d) in edges or (d, a) in edges or (c,b) in edges or (b, c) in edges:
edges[i]=(a,d)
edges[j]=(c,b)
K=nx.from_edgelist(edges)
return K
import timeit
start = timeit.default_timer()
#Your statements here
gr = shuffleNetwork(BA, 1000)
stop = timeit.default_timer()
print stop - start
You should consider using nx.double_edge_swap
The documentation is here. It looks like it does exactly what you want, but modifies the graph in place.
I'm not sure whether it will solve the speed issues, but it does avoid generating the list, so I think it will do better than what you've got.
You would call it with nx.double_edge_swap(G,nswap=number)
I want to build an algorithm in python to flip linestrings (arrays of coordinates) in a linestring collection which represent segments along a road, so that I can merge all coordinates into a single array where the coordinates are rising monotonic.
So my Segmentcollection looks something like this:
segmentCollection = [['1,1', '1,3', '2,3'],
['4,3', '2,3'],
['4,3', '7,10', '5,5']]
EDIT: SO the structure is a list of lists of 2D cartesian coordinate tuples ('1,1' for example is a point at x=1 and y=1, '7,10' is a point at x=7 and y=10, and so on). The whole problem is to merge all these lists to one list of coordinate tuples which are ordered in the sense of following a road in one direction...in fact these are segments which I get from a road network routing service,but I only get segments,where each segment is directed the way it is digitized in the database,not into the direction you have to drive. I would like to get a single polyline for the navigation route out of it.
So:
- I can assume, that all segments are in the right order
- I cannot assume that the Coordinates of each segment are in the right order
- Therefore I also cannot assume that the first coordinate of the first segment is the beginning
- And I also cannot assume that the last coordinate of the last segment is the end
- (EDIT) Even thought I Know,where the start and end point of my navigation request is located,these do not have to be identical with one of the coordinate tuples in these lists,because they only have to be somewhere near a routing graph element.
The algorithm should iterate through every segment, flip it if necessary, and append it then to the resulting array. For the first segment,the challenge is to find the starting point (the point which is NOT connected to the next segment). All other segments are then connected with one point to the last segment in the order (a directed graph).
I'd wonder if there isn't some kind of sorting data structure (sorting tree or anything) which does exactly that. Could you please give some ideas? After messing around a while with loops and array comparisons my brain is knocked out, and I just need a kick into the right direction in the true sense of the word.
If I understand correctly, you don't even need to sort things. I just translated your English text into Python:
def joinSegments( s ):
if s[0][0] == s[1][0] or s[0][0] == s[1][-1]:
s[0].reverse()
c = s[0][:]
for x in s[1:]:
if x[-1] == c[-1]:
x.reverse()
c += x
return c
It still contains duplicate points, but removing those should be straightforward.
def merge_seg(s):
index_i = 0
while index_i+1<len(s):
index_j=index_i+1
while index_j<len(s):
if c[index_i][-1] == c[index_j][0]:
c[index_i].extend(c[index_j][1:])
del c[index_j]
elif c[index_i][-1] == c[index_j][-1]:
c[index_i].extend(c[index_j].reverse()[1:])
del c[index_j]
else:
index_j+=1
index_i+=1
result = []
s.reverse()
for seg_index in range(len(s)-1):
result+=s[seg_index][:-1]#use [:-1] to delete the duplicate items
result+=s[-1]
return result
In inner while loop,every successive segment of s[index_i] is appended to s[index_i]
then index_i++ until every segments is processed.
therefore it is easy to proof that after these while loops, s[0][0] == s[1][-1], s[1][0] == s[2][-1], etc. so just reverse the list and put them together finally you will get your result.
Note: It is the most simple and straightford way, but not most time efficient.
for more algo see:http://en.wikipedia.org/wiki/Sorting_algorithm
You say that you can assume that all segments are in the right order, which means that independently of the coordinates order, your problem is basically to merge sorted arrays.
You would have to flip a segment if it's not defined in the right order, but this doesn't have a single impact on the main algorithm.
simply defind this reordering function:
def reorder(seg):
s1 = min(seg)
e1 = max(seg)
return (s1, e1)
and this comparison funciton
def cmp(seg1, seg2):
return cmp(reorder(seg1), reorder(seg2))
and you are all set, just run a typical merge algorithm:
http://en.wikipedia.org/wiki/Merge_algorithm
And in case, I didn't really understand your problem statement, here's another idea:
Use a segment tree which is a structure that is made exactly to store segments :)