I have a large network to analyze. For example:
import networkx as nx
import random
BA = nx.random_graphs.barabasi_albert_graph(1000000, 3)
nx.info(BA)
I have to shuffle the edges while keeping the degree distribution unchanged. The basic idea was introduced by Maslov. Thus, my colleague and I wrote a shuffleNetwork function in which we work on a network object G for num times. edges is a list object.
The problem is this function runs too slow for large networks. I tried to use set or dict instead of list for the edges object (set and dict are hash table). However, since we also need to delete and add elements to it, the time complexity becomes even bigger.
Do you have any suggestions on further optimising this function?
def shuffleNetwork(G,Num):
edges=G.edges()
l=range(len(edges))
for n in range(Num):
i,j = random.sample(l, 2)
a,b=edges[i]
c,d=edges[j]
if a != d and c!= b:
if not (a,d) in edges or (d, a) in edges or (c,b) in edges or (b, c) in edges:
edges[i]=(a,d)
edges[j]=(c,b)
K=nx.from_edgelist(edges)
return K
import timeit
start = timeit.default_timer()
#Your statements here
gr = shuffleNetwork(BA, 1000)
stop = timeit.default_timer()
print stop - start
You should consider using nx.double_edge_swap
The documentation is here. It looks like it does exactly what you want, but modifies the graph in place.
I'm not sure whether it will solve the speed issues, but it does avoid generating the list, so I think it will do better than what you've got.
You would call it with nx.double_edge_swap(G,nswap=number)
Related
I have a large list (1e8+ entries) in the form [index:boolean]. I want to find the indices of the values that are True. Performance in this task is of the utmost importance.
Currently from what I can tell in Python 3.7.2, the optimal way of doing this is using a list comprehension as follows:
return [i for i, j in enumerate(numbers) if j]
I have also tried the following (although it seems to simply be the preferred method from earlier versions of Python):
return list(filter(lambda a: a, numbers))
The second approach is about 25% slower than the first approach.
Currently, this operation takes about (0.8*x) time, where the actual logic part of my algorithm takes 'x' time. (For example if the logic takes 10 seconds, extracting the positive values from the list takes about 8 second). I had hoped this operation would be much faster.
Performance in this task is of the utmost importance
Then you should consider using a numpy array:
import numpy as np
from random import choice
from timeit import Timer
bools = True, False
li = [choice(bools) for _ in range(int(1e8))]
arr = np.array(li)
print(Timer(lambda: np.nonzero(arr)).repeat(1, 1))
Outputs
[0.4524359999999916]
That's 0.4524359999999916 seconds.
I was unable to find anything describing how to do this, which leads to be believe I'm not doing this in the proper idiomatic Python way. Advice on the 'proper' Python way to do this would also be appreciated.
I have a bunch of variables for a datalogger I'm writing (arbitrary logging length, with a known maximum length). In MATLAB, I would initialize them all as 1-D arrays of zeros of length n, n bigger than the number of entries I would ever see, assign each individual element variable(measurement_no) = data_point in the logging loop, and trim off the extraneous zeros when the measurement was over. The initialization would look like this:
[dData gData cTotalEnergy cResFinal etc] = deal(zeros(n,1));
Is there a way to do this in Python/NumPy so I don't either have to put each variable on its own line:
dData = np.zeros(n)
gData = np.zeros(n)
etc.
I would also prefer not just make one big matrix, because keeping track of which column is which variable is unpleasant. Perhaps the solution is to make the (length x numvars) matrix, and assign the column slices out to individual variables?
EDIT: Assume I'm going to have a lot of vectors of the same length by the time this is over; e.g., my post-processing takes each log file, calculates a bunch of separate metrics (>50), stores them, and repeats until the logs are all processed. Then I generate histograms, means/maxes/sigmas/etc. for all the various metrics I computed. Since initializing 50+ vectors is clearly not easy in Python, what's the best (cleanest code and decent performance) way of doing this?
If you're really motivated to do this in a one-liner you could create an (n_vars, ...) array of zeros, then unpack it along the first dimension:
a, b, c = np.zeros((3, 5))
print(a is b)
# False
Another option is to use a list comprehension or a generator expression:
a, b, c = [np.zeros(5) for _ in range(3)] # list comprehension
d, e, f = (np.zeros(5) for _ in range(3)) # generator expression
print(a is b, d is e)
# False False
Be careful, though! You might think that using the * operator on a list or tuple containing your call to np.zeros() would achieve the same thing, but it doesn't:
h, i, j = (np.zeros(5),) * 3
print(h is i)
# True
This is because the expression inside the tuple gets evaluated first. np.zeros(5) therefore only gets called once, and each element in the repeated tuple ends up being a reference to the same array. This is the same reason why you can't just use a = b = c = np.zeros(5).
Unless you really need to assign a large number of empty array variables and you really care deeply about making your code compact (!), I would recommend initialising them on separate lines for readability.
Nothing wrong or un-Pythonic with
dData = np.zeros(n)
gData = np.zeros(n)
etc.
You could put them on one line, but there's no particular reason to do so.
dData, gData = np.zeros(n), np.zeros(n)
Don't try dData = gData = np.zeros(n), because a change to dData changes gData (they point to the same object). For the same reason you usually don't want to use x = y = [].
The deal in MATLAB is a convenience, but isn't magical. Here's how Octave implements it
function [varargout] = deal (varargin)
if (nargin == 0)
print_usage ();
elseif (nargin == 1 || nargin == nargout)
varargout(1:nargout) = varargin;
else
error ("deal: nargin > 1 and nargin != nargout");
endif
endfunction
In contrast to Python, in Octave (and presumably MATLAB)
one=two=three=zeros(1,3)
assigns different objects to the 3 variables.
Notice also how MATLAB talks about deal as a way of assigning contents of cells and structure arrays. http://www.mathworks.com/company/newsletters/articles/whats-the-big-deal.html
If you put your data in a collections.defaultdict you won't need to do any explicit initialization. Everything will be initialized the first time it is used.
import numpy as np
import collections
n = 100
data = collections.defaultdict(lambda: np.zeros(n))
for i in range(1, n):
data['g'][i] = data['d'][i - 1]
# ...
How about using map:
import numpy as np
n = 10 # Number of data points per array
m = 3 # Number of arrays being initialised
gData, pData, qData = map(np.zeros, [n] * m)
I carefully read the docs, but it still is unclear to me how to use G.forEdges(), described as an "experimental edge iterator interface".
Let's say that I want to decrease the density of my graph. I have a sorted list of weights, and I want to remove edges based on their weight until the graph splits into two connected components. Then I'll select the minimum number of links that keeps the graph connected. I would do something like this:
cc = components.ConnectedComponents(G).run()
while cc.numberOfComponents()==1:
for weight in weightlist:
for (u,v) in G.edges():
if G.weight(u,v)==weight:
G=G.removeEdge(u,v)
By the way I know from the docs that there is this edge iterator, which probably does the iteration in a more efficient way. But from the docs I really can't understand how to correctly use this forEdges, and I can't find a single example over the internet. Any ideas?
Or maybe an alternative idea to do what I want to do: since it's a huge graph (125millions links) the iteration will take forever, even if I am working on a cluster.
NetworKit iterators accept a callback function so if you want to iterate over edges (or nodes) you have to define a function and then pass it to the iterator as a parameter. You can find more information here. For example a simple function that just prints all edges is:
# Callback function.
# To iterate over edges it must accept 4 parameters
def myFunction(u, v, weight, edgeId):
print("Edge from {} to {} has weight {} and id {}".format(u, v, weight, edgeId))
# Using iterator with callback function
G.forEdges(myFunction)
Now if you want to keep removing edges whose weight is inside your weightlist until the graph splits into two connected components you also have to update the connected components of the graph since ConnectedComponents will not do that for you automatically (this may be also one of the reasons why the iteration takes forever). To do this efficiently, you can use the DynConnectedComponents class (see my example below). In this case, I think that the edge iterator will not help you much so I would suggest you to keep using the for loop.
from networkit import *
# Efficiently updates connected components after edge updates
cc = components.DynConnectedComponents(G).run()
# Removes edges with weight equals to w until components split
def removeEdges(w):
for (u, v) in G.edges():
if G.weight(u, v) == weight:
G.removeEdge(u, v)
# Updating connected components
event = dynamic.GraphEvent(dynamic.GraphEvent.EDGE_REMOVAL, u, v, weight)
cc.update(event)
if cc.numberOfComponents() > 1:
# Components did split
return True
# Components did not split
return False
if cc.numberOfComponents() == 1:
for weight in weights:
if removeEdges(weight):
break
This should speed up a bit your original code. However, it is still sequential code so even if you run it on a multi-core machine it will use only one core.
I'm trying to implement an edge list for a MultiGraph in Python.
What I've tried so far:
>>> l1 = Counter({(1, 2): 2, (1, 3): 1})
>>> l2 = [(1, 2), (1, 2), (1, 3)]
l1 has constant-time deletion of all edges between two vertices (e.g. del l1[(1, 2)]) but linear-time random selection on those edges (e.g. random.choice(list(l1.elements()))). Note that you have to do a selection on elements (vs. l1 itself).
l2 has constant-time random selection (random.choice(l2)) but linear-time deletion of all elements equal to a given edge ([i for i in l2 if i != (1, 2)]).
Question: is there a Python data structure that would give me both constant-time random selection and deletion?
I don't think what you're trying to do is achievable in theory.
If you're using weighted values to represent duplicates, you can't get constant-time random selection. The best you could possibly do is some kind of skip-list-type structure that lets you binary-search the element by weighted index, which is logarithmic.
If you're not using weighted values to represent duplicates, then you need some structure that allows you to store multiple copies. And a hash table isn't going to do it—the dups have to be independent objects (e.g., (edge, autoincrement)),, meaning there's no way to delete all that match some criterion in constant time.
If you can accept logarithmic time, the obvious choice is a tree. For example, using blist:
>>> l3 = blist.sortedlist(l2)
To select one at random:
>>> edge = random.choice(l3)
The documentation doesn't seem to guarantee that this won't do something O(n). But fortunately, the source for both 3.3 and 2.7 shows that it's going to do the right thing. If you don't trust that, just write l3[random.randrange(len(l3))].
To delete all copies of an edge, you can do it like this:
>>> del l3[l3.bisect_left(edge):l3.bisect_right(edge)]
Or:
>>> try:
... while True:
... l3.remove(edge)
... except ValueError:
... pass
The documentation explains the exact performance guarantees for every operation involved. In particular, len is constant, while indexing, slicing, deleting by index or slice, bisecting, and removing by value are all logarithmic, so both operations end up logarithmic.
(It's worth noting that blist is a B+Tree; you might get better performance out of a red-black tree, or a treap, or something else. You can find good implementations for most data structures on PyPI.)
As pointed out by senderle, if the maximum number of copies of an edge is much smaller than the size of the collection, you can create a data structure that does it in time quadratic on the maximum number of copies. Translating his suggestion into code:
class MGraph(object):
def __init__(self):
self.edgelist = []
self.edgedict = defaultdict(list)
def add(self, edge):
self.edgedict[edge].append(len(self.edgelist))
self.edgelist.append(edge)
def remove(self, edge):
for index in self.edgedict.get(edge, []):
maxedge = len(self.edgelist) - 1
lastedge = self.edgelist[maxedge]
self.edgelist[index], self.edgelist[maxedge] = self.edgelist[maxedge], self.edgelist[index]
self.edgedict[lastedge] = [i if i != maxedge else index for i in self.edgedict[lastedge]]
del self.edgelist[-1]
del self.edgedict[edge]
def choice(self):
return random.choice(self.edgelist)
(You could, of course, change the replace-list-with-list-comprehension line with a three-liner find-and-update-in-place, but that's still linear in the number of dups.)
Obviously, if you plan to use this for real, you may want to beef up the class a bit. You can make it look like a list of edges, a set of tuples of multiple copies of each edge, a Counter, etc., by implementing a few methods and letting the appropriate collections.abc.Foo/collections.Foo fill in the rest.
So, which is better? Well, in your sample case, the average dup count is half the size of the list, and the maximum is 2/3rds the size. If that were true for your real data, the tree would be much, much better, because log N will obviously blow away (N/2)**2. On the other hand, if dups were rare, senderle's solution would obviously be better, because W**2 is still 1 if W is 1.
Of course for a 3-element sample, constant overhead and multipliers are going to dominate everything. But presumably your real collection isn't that tiny. (If it is, just use a list...)
If you don't know how to characterize your real data, write both implementations and time them with various realistic inputs.
I am writing a piece of code which models the evolution of a social network. The idea is that each person is assigned to a node and relationships between people (edges on the network) are given a weight of +1 or -1 depending on whether the relationship is friendly or unfriendly.
Using this simple model you can say that a triad of three people is either "balanced" or "unbalanced" depending on whether the product of the edges of the triad is positive or negative.
So finally what I am trying to do is implement an ising type model. I.e. Random edges are flipped and the new relationship is kept if the new network has more balanced triangels (a lower energy) than the network before the flip, if that is not the case then the new relationship is only kept with a certain probability.
Ok so finally onto my question: I have written the following code, however the dataset I have contains ~120k triads, as a result it will take 4 days to run!
Could anyone offer any tips on how I might optimise the code?
Thanks.
#Importing required librarys
try:
import matplotlib.pyplot as plt
except:
raise
import networkx as nx
import csv
import random
import math
def prod(iterable):
p= 1
for n in iterable:
p *= n
return p
def Sum(iterable):
p= 0
for n in iterable:
p += n[3]
return p
def CalcTriads(n):
firstgen=G.neighbors(n)
Edges=[]
Triads=[]
for i in firstgen:
Edges.append(G.edges(i))
for i in xrange(len(Edges)):
for j in range(len(Edges[i])):# For node n go through the list of edges (j) for the neighboring nodes (i)
if set([Edges[i][j][1]]).issubset(firstgen):# If the second node on the edge is also a neighbor of n (its in firstgen) then keep the edge.
t=[n,Edges[i][j][0],Edges[i][j][1]]
t.sort()
Triads.append(t)# Add found nodes to Triads.
new_Triads = []# Delete duplicate triads.
for elem in Triads:
if elem not in new_Triads:
new_Triads.append(elem)
Triads = new_Triads
for i in xrange(len(Triads)):# Go through list of all Triads finding the weights of their edges using G[node1][node2]. Multiply the three weights and append value to each triad.
a=G[Triads[i][0]][Triads[i][1]].values()
b=G[Triads[i][1]][Triads[i][2]].values()
c=G[Triads[i][2]][Triads[i][0]].values()
Q=prod(a+b+c)
Triads[i].append(Q)
return Triads
###### Import sorted edge data ######
li=[]
with open('Sorted Data.csv', 'rU') as f:
reader = csv.reader(f)
for row in reader:
li.append([float(row[0]),float(row[1]),float(row[2])])
G=nx.Graph()
G.add_weighted_edges_from(li)
for i in xrange(800000):
e = random.choice(li) # Choose random edge
TriNei=[]
a=CalcTriads(e[0]) # Find triads of first node in the chosen edge
for i in xrange(0,len(a)):
if set([e[1]]).issubset(a[i]): # Keep triads which contain the whole edge (i.e. both nodes on the edge)
TriNei.append(a[i])
preH=-Sum(TriNei) # Save the "energy" of all the triads of which the edge is a member
e[2]=-1*e[2]# Flip the weight of the random edge and create a new graph with the flipped edge
G.clear()
G.add_weighted_edges_from(li)
TriNei=[]
a=CalcTriads(e[0])
for i in xrange(0,len(a)):
if set([e[1]]).issubset(a[i]):
TriNei.append(a[i])
postH=-Sum(TriNei)# Calculate the post flip "energy".
if postH<preH:# If the post flip energy is lower then the pre flip energy keep the change
continue
elif random.random() < 0.92: # If the post flip energy is higher then only keep the change with some small probability. (0.92 is an approximate placeholder for exp(-DeltaH)/exp(1) at the moment)
e[2]=-1*e[2]
The following suggestions won't boost your performance that much because they are not on the algorithmic level, i.e. not very specific to your problem. However, they are generic suggestions for slight performance improvements:
Unless you are using Python 3, change
for i in range(800000):
to
for i in xrange(800000):
The latter one just iterates numbers from 0 to 800000, the first one creates a huge list of numbers and then iterates that list. Do something similar for the other loops using range.
Also, change
j=random.choice(range(len(li)))
e=li[j] # Choose random edge
to
e = random.choice(li)
and use e instead of li[j] subsequently. If you really need a index number, use random.randint(0, len(li)-1).
There are syntactic changes you can make to speed things up, such as replacing your Sum and Prod functions with the built-in equivalents sum(x[3] for x in iterable) and reduce(operator.mul, iterable) - it is generally faster to use builtin functions or generator expressions than explicit loops.
As far as I can tell the line:
if set([e[1]]).issubset(a[i]): # Keep triads which contain the whole edge (i.e. both nodes on the edge)
is testing if a float is in a list of floats. Replacing it with if e[1] in a[i]: will remove the overhead of creating two set objects for each comparison.
Incidentally, you do not need to loop through the index values of an array, if you are only going to use that index to access the elements. e.g. replace
for i in range(0,len(a)):
if set([e[1]]).issubset(a[i]): # Keep triads which contain the whole edge (i.e. both nodes on the edge)
TriNei.append(a[i])
with
for x in a:
if set([e[1]]).issubset(x): # Keep triads which contain the whole edge (i.e. both nodes on the edge)
TriNei.append(x)
However I suspect that changes like this will not make a big difference to the overall runtime. To do that you either need to use a different algorithm or switch to a faster language. You could try running it in pypy - for some cases it can be significantly faster than CPython. You could also try cython, which will compile your code to C and can sometimes give a big performance gain especially if you annotate your code with cython type information. I think the biggest improvement may come from changing the algorithm to one that does less work, but I don't have any suggestions for that.
BTW, why loop 800000 times? What is the significance of that number?
Also, please use meaningful names for your variables. Using single character names or shrtAbbrv does not speed the code up at all, and makes it very hard to follow what it is doing.
There are quite a few things you can improve here. Start by profiling your program using a tool like cProfile. This will tell you where most of the program's time is being spent and thus where optimization is likely to be most helpful. As a hint, you don't need to generate all the triads at every iteration of the program.
You also need to fix your indentation before you can expect a decent answer.
Regardless, this question might be better suited to Code Review.
I'm not sure I understand exactly what you are aiming for, but there are at least two changes that might help. You probably don't need to destroy and create the graph every time in the loop since all you are doing is flipping one edge weight sign. And the computation to find the triangles can be improved.
Here is some code that generates a complete graph with random weights, picks a random edge in a loop, finds the triads and flips the edge weight...
import random
import networkx as nx
# complete graph with random 1/-1 as weight
G=nx.complete_graph(5)
for u,v,d in G.edges(data=True):
d['weight']=random.randrange(-1,2,2) # -1 or 1
edges=G.edges()
for i in range(10):
u,v = random.choice(edges) # random edge
nbrs = set(G[u]) & set(G[v]) - set([u,v]) # nodes in traids
triads = [(u,v,n) for n in nbrs]
print "triads",triads
for u,v,w in triads:
print (u,v,G[u][v]['weight']),(u,w,G[u][w]['weight']),(v,w,G[v][w]['weight'])
G[u][v]['weight']*=-1