Constant time random selection and deletion - python

I'm trying to implement an edge list for a MultiGraph in Python.
What I've tried so far:
>>> l1 = Counter({(1, 2): 2, (1, 3): 1})
>>> l2 = [(1, 2), (1, 2), (1, 3)]
l1 has constant-time deletion of all edges between two vertices (e.g. del l1[(1, 2)]) but linear-time random selection on those edges (e.g. random.choice(list(l1.elements()))). Note that you have to do a selection on elements (vs. l1 itself).
l2 has constant-time random selection (random.choice(l2)) but linear-time deletion of all elements equal to a given edge ([i for i in l2 if i != (1, 2)]).
Question: is there a Python data structure that would give me both constant-time random selection and deletion?

I don't think what you're trying to do is achievable in theory.
If you're using weighted values to represent duplicates, you can't get constant-time random selection. The best you could possibly do is some kind of skip-list-type structure that lets you binary-search the element by weighted index, which is logarithmic.
If you're not using weighted values to represent duplicates, then you need some structure that allows you to store multiple copies. And a hash table isn't going to do it—the dups have to be independent objects (e.g., (edge, autoincrement)),, meaning there's no way to delete all that match some criterion in constant time.
If you can accept logarithmic time, the obvious choice is a tree. For example, using blist:
>>> l3 = blist.sortedlist(l2)
To select one at random:
>>> edge = random.choice(l3)
The documentation doesn't seem to guarantee that this won't do something O(n). But fortunately, the source for both 3.3 and 2.7 shows that it's going to do the right thing. If you don't trust that, just write l3[random.randrange(len(l3))].
To delete all copies of an edge, you can do it like this:
>>> del l3[l3.bisect_left(edge):l3.bisect_right(edge)]
Or:
>>> try:
... while True:
... l3.remove(edge)
... except ValueError:
... pass
The documentation explains the exact performance guarantees for every operation involved. In particular, len is constant, while indexing, slicing, deleting by index or slice, bisecting, and removing by value are all logarithmic, so both operations end up logarithmic.
(It's worth noting that blist is a B+Tree; you might get better performance out of a red-black tree, or a treap, or something else. You can find good implementations for most data structures on PyPI.)
As pointed out by senderle, if the maximum number of copies of an edge is much smaller than the size of the collection, you can create a data structure that does it in time quadratic on the maximum number of copies. Translating his suggestion into code:
class MGraph(object):
def __init__(self):
self.edgelist = []
self.edgedict = defaultdict(list)
def add(self, edge):
self.edgedict[edge].append(len(self.edgelist))
self.edgelist.append(edge)
def remove(self, edge):
for index in self.edgedict.get(edge, []):
maxedge = len(self.edgelist) - 1
lastedge = self.edgelist[maxedge]
self.edgelist[index], self.edgelist[maxedge] = self.edgelist[maxedge], self.edgelist[index]
self.edgedict[lastedge] = [i if i != maxedge else index for i in self.edgedict[lastedge]]
del self.edgelist[-1]
del self.edgedict[edge]
def choice(self):
return random.choice(self.edgelist)
(You could, of course, change the replace-list-with-list-comprehension line with a three-liner find-and-update-in-place, but that's still linear in the number of dups.)
Obviously, if you plan to use this for real, you may want to beef up the class a bit. You can make it look like a list of edges, a set of tuples of multiple copies of each edge, a Counter, etc., by implementing a few methods and letting the appropriate collections.abc.Foo/collections.Foo fill in the rest.
So, which is better? Well, in your sample case, the average dup count is half the size of the list, and the maximum is 2/3rds the size. If that were true for your real data, the tree would be much, much better, because log N will obviously blow away (N/2)**2. On the other hand, if dups were rare, senderle's solution would obviously be better, because W**2 is still 1 if W is 1.
Of course for a 3-element sample, constant overhead and multipliers are going to dominate everything. But presumably your real collection isn't that tiny. (If it is, just use a list...)
If you don't know how to characterize your real data, write both implementations and time them with various realistic inputs.

Related

How can i make this python code quicklier?

How can i make this code more quicklier?
def add_may_go(x,y):
counter = 0
for i in range(-2,3):
cur_y = y + i
if cur_y < 0 or cur_y >= board_order:
continue
for j in range(-2,3):
cur_x = x+j
if (i == 0 and j == 0) or cur_x < 0 or cur_x >= board_order or [cur_y,cur_x] in huge_may_go:
continue
if not public_grid[cur_y][cur_x]:
huge_may_go.append([cur_y,cur_x])
counter += 1
return counter
INPUT:
something like: add_may_go(8,8), add_may_go(8,9) ...
huge_may_go is a huge list like:
[[7,8],[7,9], [8,8],[8,9],[8,10]....]
public_grid is also a huge list, the size is same as board_order*board_order
every content it have has to possble from : 0 or 1
like:
[
[0,1,0,1,0,1,1,...(board_order times), 0, 1],
... board_order times
[1,0,1,1,0,0,1,...(board_order times), 0, 1],
]
an board_order is a global variable which usually is 19 (sometimes it is 15 or 20)
It runs toooooooo slowy now. This function is gonna run for hundreds time. Any possible suggestions is ok!
I have tried numpy. But numpy make it more slowly! Please help
It is difficult to provide a definitive improvement without sample data and a bit more context. Using numpy would be beneficial if you can manage to perform all the calls (i.e. all (x,y) coordinate values) in a single operation. There are also strategies based on sets that could work but you would need to maintain additional data structures in parallel with the public_grid.
Based only on that piece of code, and without changing the rest of the program, there are a couple of things you could do that will provide small performance improvements:
loop only on eligible coordinates rather than skip invalid ones (outside of board)
only compute the curr_x, curr_y values once (track them in a dictionary for each x,y pairs). This is assuming that the same x,y coordinates are used in multiple calls to the function.
use comprehensions when possible
use set operations to avoid duplicate coordinates in huge_may_go
.
hugeCoord = dict() # keep track of the offset coordinates
def add_may_go(x,y):
# compute coordinates only once (the first time)
if (x,y) not in hugeCoord:
hugeCoord[x,y] = [(cx,cy)
for cy in range(max(0,y-2),min(board_order,y+3))
for cx in range(max(0,x-2),min(board_order,x+3))
if cx != x or cy != y]
# get the resulting list of coordinates using a comprehension
fit = {(cy,cx) for cx,cy in hugeCoord[x,y] if not public_grid[cy][cx]}
fit.difference_update(huge_may_go) # use set to avoid duplicates
huge_may_go.extend(fit)
return len(fit)
Note that, if huge_may_go was a set instead of a list, adding to it without repetitions would be more efficient because you could update it directly (and return the difference in size)
if (i == 0 and j == 0)...: continue
Small improvement; reduce the number of iterations by not making those.
for i in (1,2):
do stuff with i and -i
for j in (1,2):
do stuff with j and -j
I want to highlight 2 places which need special attention.
if (...) [cur_y,cur_x] in huge_may_go:
Unlike rest of conditions, this is not arithmetic condition, but contains check, if huge_may_go is list it does take O(n) time or speaking simply is proportional to number of elements in list.
huge_may_go.append([cur_y,cur_x])
PythonWiki described .append method of list as O(1) but with disclaimer that Individual actions may take surprisingly long, depending on the history of the container. You might use collections.deque as replacement for list which was designed with performance of insert (at either end) in mind.
If huge_may_go must not contain duplicates and you do not care about order, then you might use set rather than list and use it for keeping tuples of y,x (set is unable to hold lists). When using .add method of set you might skip contains check, as adding existing element will have not any effect, consider that
s = set()
s.add((1,2))
s.add((3,4))
s.add((1,2))
print(s)
gives output
{(1, 2), (3, 4)}
If you would then need some contains check, set contains check is O(1).

Given a set t of tuples containing elements from the set S, what is the most efficient way to build another set whose members are not contained in t?

For example, suppose I had an (n,2) dimensional tensor t whose elements are all from the set S containing random integers. I want to build another tensor d with size (m,2) where individual elements in each tuple are from S, but the whole tuples do not occur in t.
E.g.
S = [0,1,2,3,7]
t = [[0,1],
[7,3],
[3,1]]
d = some_algorithm(S,t)
/*
d =[[2,1],
[3,2],
[7,4]]
*/
What is the most efficient way to do this in python? Preferably with pytorch or numpy, but I can work around general solutions.
In my naive attempt, I just use
d = np.random.choice(S,(m,2))
non_dupes = [i not in t for i in d]
d = d[non_dupes]
But both t and S are incredibly large, and this takes an enormous amount of time (not to mention, rarely results in a (m,2) array). I feel like there has to be some fancy tensor thing I can do to achieve this, or maybe making a large hash map of the values in t so checking for membership in t is O(1), but this produces the same issue just with memory. Is there a more efficient way?
An approximate solution is also okay.
my naive attempt would be a base-transformation function to reduce the problem to an integer set problem:
definitions and assumptions:
let S be a set (unique elements)
let L be the number of elements in S
let t be a set of M-tuples with elements from S
the original order of the elements in t is irrelevant
let I(x) be the index function of the element x in S
let x[n] be the n-th tuple-member of an element of t
let f(x) be our base-transform function (and f^-1 its inverse)
since S is a set we can write each element in t as a M digit number to the base L using elements from S as digits.
for M=2 the transformation looks like
f(x) = I(x[1])*L^1 + I(x[0])*L^0
f^-1(x) is also rather trivial ... x mod L to get back the index of the least significant digit. floor(x/L) and repeat until all indices are extracted. lookup the values in S and construct the tuple.
since now you can represet t as an integer set (read hastable) calculating the inverse set d becomes rather trivial
loop from L^(M-1) to (L^(M+1)-1) and ask your hashtable if the element is in t or d
if the size of S is too big you can also just draw random numbers against the hashtable for a subset of the inverse of t
does this help you?
If |t| + |d| << |S|^2 then the probability of some random tuple to be chosen again (in a single iteration) is relatively small.
To be more exact, if (|t|+|d|) / |S|^2 = C for some constant C<1, then if you redraw an element until it is a "new" one, the expected number of redraws needed is 1/(1-C).
This means, that by doing this, and redrawing elements until this is a new element, you get O((1/(1-C)) * |d|) times to process a new element (on average), which is O(|d|) if C is indeed constant.
Checking is an element is already "seen" can be done in several ways:
Keeping hash sets of t and d. This requires extra space, but each lookup is constant O(1) time. You could also use a bloom filter instead of storing the actual elements you already seen, this will make some errors, saying an element is already "seen" though it was not, but never the other way around - so you will still get all elements in d as unique.
Inplace sorting t, and using binary search. This adds O(|t|log|t|) pre-processing, and O(log|t|) for each lookup, but requires no additional space (other then where you store d).
If in fact, |d| + |t| is very close to |S|^2, then an O(|S|^2) time solution could be to use Fisher Yates shuffle on the available choices, and choosing the first |d| elements that do not appear in t.

adding edges to unweighted directed graphs efficiency?

I have three graphs represented as python dictionaries
A: {1:[2], 2:[1,3], 3:[]}.
B: {1: {neighbours:[2]}, 2: {neighbours:[1,3]}, 3: {neighbours:[]}}
C: {1: {2:None}, 2: {1:None, 3:None}, 3: {}}
I have a hasEdge and addEdge function
def addEdge(self, source, target):
assert self.hasNode(source) and self.hasNode(target)
if not self.hasEdge(source, target):
self.graph[source][target] = None
def hasEdge(self, source, target):
assert self.hasNode(source) and self.hasNode(target)
return target in self.graph[source]
I am not sure which structures will be most efficient for each function, my immediate thought is the first will be the most efficient for adding a edge and the C will be the most efficient for returning if it has an edge
A and B are classic adjacency lists. C is an adjacency list, but uses an O(1) structure instead of an O(N) structure for the list. But really, you should use D, the adjacency set.
In Python set.contains(s) is an O(1) operation.
So we can do
graph = { 1: set([2]), 2: set([1, 3], 3: set() }
Then our addEdge(from, to) is
graph[from].add(to)
graph[to].add(from)
and our hasEdge(from,to) is just
to in graph[from]
C seems to be the most efficient to me, since you are doing lookups that are on average O(1). (Note that this is the average case, not the worst case.) With Adjacency Lists, you have worst case Linear Search.
For a sparse graph, you may wish to use Adjacency Lists (A), as they will take up less space. However, for a dense graph, option C should be the most efficient.
A and B will have very similar runtimes - asymptotically the same. Unless there is data besides neighbors that you wish to add to these nodes, I would choose A.
I am not familiar with python; however, for Java, option C can be improved by using a HashSet (set) which would reduce your space requirements. Runtime would be the same as using a HashMap, but sets do not store values - only keys, which is what you want for checking if there is an edge between two nodes.
So, to clarify:
For runtime, choose C. You will have average case O(1) edge adds. To improve C in order to consume less memory, use sets instead of maps, so you do not have to allocate space for values.
For memory, choose A if you have a sparse graph. You'll save a good amount of memory, and won't lose too much in terms of runtime. For reference, sparse is when nodes don't have too many neighbors; for example, when each node has about 2 neighbors in a graph with 20 nodes.

Find the 2nd highest element

In a given array how to find the 2nd, 3rd, 4th, or 5th values?
Also if we use themax() function in python what is the order of complexity i.e, associated with this function max()?
.
def nth_largest(li,n):
li.remove(max(li))
print max(ele) //will give me the second largest
#how to make a general algorithm to find the 2nd,3rd,4th highest value
#n is the element to be found below the highest value
I'd go for:
import heapq
res = heapq.nlargest(2, some_sequence)
print res[1] # to get 2nd largest
This is more efficient than sorting the entire list, then taking the first n many elements. See the heapq documentation for further info.
You could use sorted(set(element)):
>>> a = (0, 11, 100, 11, 33, 33, 55)
>>>
>>> sorted(set(a))[-1] # highest
100
>>> sorted(set(a))[-2] # second highest
55
>>>
as a function:
def nth_largest(li, n):
return sorted(set(li))[-n]
test:
>>> a = (0, 11, 100, 11, 33, 33, 55)
>>> def nth_largest(li, n):
... return sorted(set(li))[-n]
...
>>>
>>> nth_largest(a, 1)
100
>>> nth_largest(a, 2)
55
>>>
Note, here you only need to sort and remove the duplications once, if you worry about the performance you could cache the result of sorted(set(li)).
If performance is a concern (e.g.: you intend to call this a lot), then you should absolutely keep the list sorted and de-duplicated at all times, and simply the first, second, or nth element (which is o(1)).
Use the bisect module for this - it's faster than a "standard" sort.
insort lets you insert an element, and bisect will let you find whether you should be inserting at all (to avoid duplicates).
If it's not, I'd suggest the simpler:
def nth_largest(li, n):.
return sorted(set(li))[-(n+1)]
If the reverse indexing looks ugly to you, you can do:
def nth_largest(li, n):
return sorted(set(li), reverse=True)[n]
As for which method would have the lowest time complexity, this depends a lot on which types of queries you plan on making.
If you're planning on making queries into high indexes (e.g. 36th largest element in a list with 38 elements), your function nth_largest(li,n) will have close to O(n^2) time complexity since it will have to do max, which is O(n), several times. It will be similar to the Selection Sort algorithm except using max() instead of min().
On the other hand, if you are only making low index queries, then your function can be efficient as it will only apply the O(n) max function a few times and the time complexity will be close to O(n). However, building a max heap is possible in linear time O(n) and you would be better off just using that. After you go through the trouble of constructing a heap, all of your max() operations on the heap will be O(1) which could be a better long-term solution for you.
I believe the most scalable way (in terms of being able to query nth largest element for any n) is to sort the list with time complexity O(n log n) using the built-in sort function and then make O(1) queries from the sorted list. Of course, that's not the most memory-efficient method but in terms of time complexity it is very efficient.
If you do not mind using numpy (import numpy as np):
np.partition(numbers, -i)[-i]
gives you the ith largest element of the list with a guaranteed worst-case O(n) running time.
The partition(a, kth) methods returns an array where the kth element is the same it would be in a sorted array, all elements before are smaller, and all behind are larger.
How about:
sorted(li)[::-1][n]

Optimising model of social network evolution

I am writing a piece of code which models the evolution of a social network. The idea is that each person is assigned to a node and relationships between people (edges on the network) are given a weight of +1 or -1 depending on whether the relationship is friendly or unfriendly.
Using this simple model you can say that a triad of three people is either "balanced" or "unbalanced" depending on whether the product of the edges of the triad is positive or negative.
So finally what I am trying to do is implement an ising type model. I.e. Random edges are flipped and the new relationship is kept if the new network has more balanced triangels (a lower energy) than the network before the flip, if that is not the case then the new relationship is only kept with a certain probability.
Ok so finally onto my question: I have written the following code, however the dataset I have contains ~120k triads, as a result it will take 4 days to run!
Could anyone offer any tips on how I might optimise the code?
Thanks.
#Importing required librarys
try:
import matplotlib.pyplot as plt
except:
raise
import networkx as nx
import csv
import random
import math
def prod(iterable):
p= 1
for n in iterable:
p *= n
return p
def Sum(iterable):
p= 0
for n in iterable:
p += n[3]
return p
def CalcTriads(n):
firstgen=G.neighbors(n)
Edges=[]
Triads=[]
for i in firstgen:
Edges.append(G.edges(i))
for i in xrange(len(Edges)):
for j in range(len(Edges[i])):# For node n go through the list of edges (j) for the neighboring nodes (i)
if set([Edges[i][j][1]]).issubset(firstgen):# If the second node on the edge is also a neighbor of n (its in firstgen) then keep the edge.
t=[n,Edges[i][j][0],Edges[i][j][1]]
t.sort()
Triads.append(t)# Add found nodes to Triads.
new_Triads = []# Delete duplicate triads.
for elem in Triads:
if elem not in new_Triads:
new_Triads.append(elem)
Triads = new_Triads
for i in xrange(len(Triads)):# Go through list of all Triads finding the weights of their edges using G[node1][node2]. Multiply the three weights and append value to each triad.
a=G[Triads[i][0]][Triads[i][1]].values()
b=G[Triads[i][1]][Triads[i][2]].values()
c=G[Triads[i][2]][Triads[i][0]].values()
Q=prod(a+b+c)
Triads[i].append(Q)
return Triads
###### Import sorted edge data ######
li=[]
with open('Sorted Data.csv', 'rU') as f:
reader = csv.reader(f)
for row in reader:
li.append([float(row[0]),float(row[1]),float(row[2])])
G=nx.Graph()
G.add_weighted_edges_from(li)
for i in xrange(800000):
e = random.choice(li) # Choose random edge
TriNei=[]
a=CalcTriads(e[0]) # Find triads of first node in the chosen edge
for i in xrange(0,len(a)):
if set([e[1]]).issubset(a[i]): # Keep triads which contain the whole edge (i.e. both nodes on the edge)
TriNei.append(a[i])
preH=-Sum(TriNei) # Save the "energy" of all the triads of which the edge is a member
e[2]=-1*e[2]# Flip the weight of the random edge and create a new graph with the flipped edge
G.clear()
G.add_weighted_edges_from(li)
TriNei=[]
a=CalcTriads(e[0])
for i in xrange(0,len(a)):
if set([e[1]]).issubset(a[i]):
TriNei.append(a[i])
postH=-Sum(TriNei)# Calculate the post flip "energy".
if postH<preH:# If the post flip energy is lower then the pre flip energy keep the change
continue
elif random.random() < 0.92: # If the post flip energy is higher then only keep the change with some small probability. (0.92 is an approximate placeholder for exp(-DeltaH)/exp(1) at the moment)
e[2]=-1*e[2]
The following suggestions won't boost your performance that much because they are not on the algorithmic level, i.e. not very specific to your problem. However, they are generic suggestions for slight performance improvements:
Unless you are using Python 3, change
for i in range(800000):
to
for i in xrange(800000):
The latter one just iterates numbers from 0 to 800000, the first one creates a huge list of numbers and then iterates that list. Do something similar for the other loops using range.
Also, change
j=random.choice(range(len(li)))
e=li[j] # Choose random edge
to
e = random.choice(li)
and use e instead of li[j] subsequently. If you really need a index number, use random.randint(0, len(li)-1).
There are syntactic changes you can make to speed things up, such as replacing your Sum and Prod functions with the built-in equivalents sum(x[3] for x in iterable) and reduce(operator.mul, iterable) - it is generally faster to use builtin functions or generator expressions than explicit loops.
As far as I can tell the line:
if set([e[1]]).issubset(a[i]): # Keep triads which contain the whole edge (i.e. both nodes on the edge)
is testing if a float is in a list of floats. Replacing it with if e[1] in a[i]: will remove the overhead of creating two set objects for each comparison.
Incidentally, you do not need to loop through the index values of an array, if you are only going to use that index to access the elements. e.g. replace
for i in range(0,len(a)):
if set([e[1]]).issubset(a[i]): # Keep triads which contain the whole edge (i.e. both nodes on the edge)
TriNei.append(a[i])
with
for x in a:
if set([e[1]]).issubset(x): # Keep triads which contain the whole edge (i.e. both nodes on the edge)
TriNei.append(x)
However I suspect that changes like this will not make a big difference to the overall runtime. To do that you either need to use a different algorithm or switch to a faster language. You could try running it in pypy - for some cases it can be significantly faster than CPython. You could also try cython, which will compile your code to C and can sometimes give a big performance gain especially if you annotate your code with cython type information. I think the biggest improvement may come from changing the algorithm to one that does less work, but I don't have any suggestions for that.
BTW, why loop 800000 times? What is the significance of that number?
Also, please use meaningful names for your variables. Using single character names or shrtAbbrv does not speed the code up at all, and makes it very hard to follow what it is doing.
There are quite a few things you can improve here. Start by profiling your program using a tool like cProfile. This will tell you where most of the program's time is being spent and thus where optimization is likely to be most helpful. As a hint, you don't need to generate all the triads at every iteration of the program.
You also need to fix your indentation before you can expect a decent answer.
Regardless, this question might be better suited to Code Review.
I'm not sure I understand exactly what you are aiming for, but there are at least two changes that might help. You probably don't need to destroy and create the graph every time in the loop since all you are doing is flipping one edge weight sign. And the computation to find the triangles can be improved.
Here is some code that generates a complete graph with random weights, picks a random edge in a loop, finds the triads and flips the edge weight...
import random
import networkx as nx
# complete graph with random 1/-1 as weight
G=nx.complete_graph(5)
for u,v,d in G.edges(data=True):
d['weight']=random.randrange(-1,2,2) # -1 or 1
edges=G.edges()
for i in range(10):
u,v = random.choice(edges) # random edge
nbrs = set(G[u]) & set(G[v]) - set([u,v]) # nodes in traids
triads = [(u,v,n) for n in nbrs]
print "triads",triads
for u,v,w in triads:
print (u,v,G[u][v]['weight']),(u,w,G[u][w]['weight']),(v,w,G[v][w]['weight'])
G[u][v]['weight']*=-1

Categories