I have three graphs represented as python dictionaries
A: {1:[2], 2:[1,3], 3:[]}.
B: {1: {neighbours:[2]}, 2: {neighbours:[1,3]}, 3: {neighbours:[]}}
C: {1: {2:None}, 2: {1:None, 3:None}, 3: {}}
I have a hasEdge and addEdge function
def addEdge(self, source, target):
assert self.hasNode(source) and self.hasNode(target)
if not self.hasEdge(source, target):
self.graph[source][target] = None
def hasEdge(self, source, target):
assert self.hasNode(source) and self.hasNode(target)
return target in self.graph[source]
I am not sure which structures will be most efficient for each function, my immediate thought is the first will be the most efficient for adding a edge and the C will be the most efficient for returning if it has an edge
A and B are classic adjacency lists. C is an adjacency list, but uses an O(1) structure instead of an O(N) structure for the list. But really, you should use D, the adjacency set.
In Python set.contains(s) is an O(1) operation.
So we can do
graph = { 1: set([2]), 2: set([1, 3], 3: set() }
Then our addEdge(from, to) is
graph[from].add(to)
graph[to].add(from)
and our hasEdge(from,to) is just
to in graph[from]
C seems to be the most efficient to me, since you are doing lookups that are on average O(1). (Note that this is the average case, not the worst case.) With Adjacency Lists, you have worst case Linear Search.
For a sparse graph, you may wish to use Adjacency Lists (A), as they will take up less space. However, for a dense graph, option C should be the most efficient.
A and B will have very similar runtimes - asymptotically the same. Unless there is data besides neighbors that you wish to add to these nodes, I would choose A.
I am not familiar with python; however, for Java, option C can be improved by using a HashSet (set) which would reduce your space requirements. Runtime would be the same as using a HashMap, but sets do not store values - only keys, which is what you want for checking if there is an edge between two nodes.
So, to clarify:
For runtime, choose C. You will have average case O(1) edge adds. To improve C in order to consume less memory, use sets instead of maps, so you do not have to allocate space for values.
For memory, choose A if you have a sparse graph. You'll save a good amount of memory, and won't lose too much in terms of runtime. For reference, sparse is when nodes don't have too many neighbors; for example, when each node has about 2 neighbors in a graph with 20 nodes.
Related
For example, suppose I had an (n,2) dimensional tensor t whose elements are all from the set S containing random integers. I want to build another tensor d with size (m,2) where individual elements in each tuple are from S, but the whole tuples do not occur in t.
E.g.
S = [0,1,2,3,7]
t = [[0,1],
[7,3],
[3,1]]
d = some_algorithm(S,t)
/*
d =[[2,1],
[3,2],
[7,4]]
*/
What is the most efficient way to do this in python? Preferably with pytorch or numpy, but I can work around general solutions.
In my naive attempt, I just use
d = np.random.choice(S,(m,2))
non_dupes = [i not in t for i in d]
d = d[non_dupes]
But both t and S are incredibly large, and this takes an enormous amount of time (not to mention, rarely results in a (m,2) array). I feel like there has to be some fancy tensor thing I can do to achieve this, or maybe making a large hash map of the values in t so checking for membership in t is O(1), but this produces the same issue just with memory. Is there a more efficient way?
An approximate solution is also okay.
my naive attempt would be a base-transformation function to reduce the problem to an integer set problem:
definitions and assumptions:
let S be a set (unique elements)
let L be the number of elements in S
let t be a set of M-tuples with elements from S
the original order of the elements in t is irrelevant
let I(x) be the index function of the element x in S
let x[n] be the n-th tuple-member of an element of t
let f(x) be our base-transform function (and f^-1 its inverse)
since S is a set we can write each element in t as a M digit number to the base L using elements from S as digits.
for M=2 the transformation looks like
f(x) = I(x[1])*L^1 + I(x[0])*L^0
f^-1(x) is also rather trivial ... x mod L to get back the index of the least significant digit. floor(x/L) and repeat until all indices are extracted. lookup the values in S and construct the tuple.
since now you can represet t as an integer set (read hastable) calculating the inverse set d becomes rather trivial
loop from L^(M-1) to (L^(M+1)-1) and ask your hashtable if the element is in t or d
if the size of S is too big you can also just draw random numbers against the hashtable for a subset of the inverse of t
does this help you?
If |t| + |d| << |S|^2 then the probability of some random tuple to be chosen again (in a single iteration) is relatively small.
To be more exact, if (|t|+|d|) / |S|^2 = C for some constant C<1, then if you redraw an element until it is a "new" one, the expected number of redraws needed is 1/(1-C).
This means, that by doing this, and redrawing elements until this is a new element, you get O((1/(1-C)) * |d|) times to process a new element (on average), which is O(|d|) if C is indeed constant.
Checking is an element is already "seen" can be done in several ways:
Keeping hash sets of t and d. This requires extra space, but each lookup is constant O(1) time. You could also use a bloom filter instead of storing the actual elements you already seen, this will make some errors, saying an element is already "seen" though it was not, but never the other way around - so you will still get all elements in d as unique.
Inplace sorting t, and using binary search. This adds O(|t|log|t|) pre-processing, and O(log|t|) for each lookup, but requires no additional space (other then where you store d).
If in fact, |d| + |t| is very close to |S|^2, then an O(|S|^2) time solution could be to use Fisher Yates shuffle on the available choices, and choosing the first |d| elements that do not appear in t.
Here is the situation:
I have a graph type structure, an adjacency list, and each element of this adjacency list is a 1 dimensional array (either numpy, or bcolz.. not sure if I will use bcolz or not).
Each 1-dimensional array represents graph elements that could possibly connect, in the form of binary sequences. For them to connect, they need to have a specific bitwise intersection value.
Therefore, for each 1 dimensional array in my adjacency list, I want to do the bitwise "and" between every combination of two elements in the given array.
This will possibly be used for huge graph breadth-first traversal, so we may be talking a very very large number of elements.
Is this something I can do with vectorized operations? Should I be using a different structure? What is a good way to do this? I am willing to completely restructure everything if there could be a significant performance boost.
Is it as simple as looping through the individual elements and then broadcasting(correct terminology?) & against the entire array? Thanks.
quick edit
As an extra note, I am using python integers for my byte sequences. Which from my understanding, doesn't play well with numpy(the integers get too big, type long long). I have to create arrays of object type. Does this potentially cause a huge slowdown? Is it a reason to use a different structure?
An Example
//create an nxn adjacency list, where n is number of graph nodes.
//map each graph node to a value 2^k:
nodevals = defaultdict()
for i in xrange(n):
nodevals[i] = 2**(i+1)
//for each edge in our graph it is comprised of two nodes, which are mapped as powers of two. Take their sum, and place them in the adjacency list:
for i in xrange(n):
for j in xrange(n):
adjlist[i][j].append((nodevals[i]|nodevals[j]))
//We now have our first Adjacency list, which is just bare edges. These edges can be connected by row or column, by taking intersection of the sum (nodevals[i]|nodevals[j]) with the other edge (nodevals[i2]|nodevals[j2]), and checking if it equals the connection point for each.
//this may not seem useful for individual edges, but in future iterations we can do this:
//After 3 iterations. (5,1) connected to (1,9), and then this connected to (7,5), for example:
adjlist[5][1] & adjlist[1][9] == 1
adjlist2[5][9] == adjlist[5][1]|adjlist[1][9]
adjlist[7][5] & adjlist2[5][9] == 5
adjlist3[7][9] == adjlist[7][5]|adjlist2[5][9]
//So, you may see how this could be useful for efficient traversal.
//However, it becomes more complicated, because as we increase the length of our subpaths, or "pseudo-edges", or whatever you want to call them,
//The arrays for the given (i,j) hold more and more subpath sums that can potentially be connected.
//Soon the arrays can become very large, which is when I would want to efficiently be able to calculate intersections
//AND, for this problem in particular, I want to be able to connect edges against the SAME edge, so I want to do the bitwise intersection between all pairs of elements in the given array (ie, the given indices [i][j] of that adjacency list).
I'm trying to implement an edge list for a MultiGraph in Python.
What I've tried so far:
>>> l1 = Counter({(1, 2): 2, (1, 3): 1})
>>> l2 = [(1, 2), (1, 2), (1, 3)]
l1 has constant-time deletion of all edges between two vertices (e.g. del l1[(1, 2)]) but linear-time random selection on those edges (e.g. random.choice(list(l1.elements()))). Note that you have to do a selection on elements (vs. l1 itself).
l2 has constant-time random selection (random.choice(l2)) but linear-time deletion of all elements equal to a given edge ([i for i in l2 if i != (1, 2)]).
Question: is there a Python data structure that would give me both constant-time random selection and deletion?
I don't think what you're trying to do is achievable in theory.
If you're using weighted values to represent duplicates, you can't get constant-time random selection. The best you could possibly do is some kind of skip-list-type structure that lets you binary-search the element by weighted index, which is logarithmic.
If you're not using weighted values to represent duplicates, then you need some structure that allows you to store multiple copies. And a hash table isn't going to do it—the dups have to be independent objects (e.g., (edge, autoincrement)),, meaning there's no way to delete all that match some criterion in constant time.
If you can accept logarithmic time, the obvious choice is a tree. For example, using blist:
>>> l3 = blist.sortedlist(l2)
To select one at random:
>>> edge = random.choice(l3)
The documentation doesn't seem to guarantee that this won't do something O(n). But fortunately, the source for both 3.3 and 2.7 shows that it's going to do the right thing. If you don't trust that, just write l3[random.randrange(len(l3))].
To delete all copies of an edge, you can do it like this:
>>> del l3[l3.bisect_left(edge):l3.bisect_right(edge)]
Or:
>>> try:
... while True:
... l3.remove(edge)
... except ValueError:
... pass
The documentation explains the exact performance guarantees for every operation involved. In particular, len is constant, while indexing, slicing, deleting by index or slice, bisecting, and removing by value are all logarithmic, so both operations end up logarithmic.
(It's worth noting that blist is a B+Tree; you might get better performance out of a red-black tree, or a treap, or something else. You can find good implementations for most data structures on PyPI.)
As pointed out by senderle, if the maximum number of copies of an edge is much smaller than the size of the collection, you can create a data structure that does it in time quadratic on the maximum number of copies. Translating his suggestion into code:
class MGraph(object):
def __init__(self):
self.edgelist = []
self.edgedict = defaultdict(list)
def add(self, edge):
self.edgedict[edge].append(len(self.edgelist))
self.edgelist.append(edge)
def remove(self, edge):
for index in self.edgedict.get(edge, []):
maxedge = len(self.edgelist) - 1
lastedge = self.edgelist[maxedge]
self.edgelist[index], self.edgelist[maxedge] = self.edgelist[maxedge], self.edgelist[index]
self.edgedict[lastedge] = [i if i != maxedge else index for i in self.edgedict[lastedge]]
del self.edgelist[-1]
del self.edgedict[edge]
def choice(self):
return random.choice(self.edgelist)
(You could, of course, change the replace-list-with-list-comprehension line with a three-liner find-and-update-in-place, but that's still linear in the number of dups.)
Obviously, if you plan to use this for real, you may want to beef up the class a bit. You can make it look like a list of edges, a set of tuples of multiple copies of each edge, a Counter, etc., by implementing a few methods and letting the appropriate collections.abc.Foo/collections.Foo fill in the rest.
So, which is better? Well, in your sample case, the average dup count is half the size of the list, and the maximum is 2/3rds the size. If that were true for your real data, the tree would be much, much better, because log N will obviously blow away (N/2)**2. On the other hand, if dups were rare, senderle's solution would obviously be better, because W**2 is still 1 if W is 1.
Of course for a 3-element sample, constant overhead and multipliers are going to dominate everything. But presumably your real collection isn't that tiny. (If it is, just use a list...)
If you don't know how to characterize your real data, write both implementations and time them with various realistic inputs.
I have a (un-directed) graph represented using adjacency lists, e.g.
a: b, c, e
b: a, d
c: a, d
d: b, c
e: a
where each node of the graph is linked to a list of other node(s)
I want to update such a graph given some new list(s) for certain node(s), e.g.
a: b, c, d
where a is no longer connected to e, and is connected to a new node d
What would be an efficient (both time and space wise) algorithm for performing such updates to the graph?
Maybe I'm missing something, but wouldn't it be fastest to use a dictionary (or default dict) of node-labels (strings or numbers) to sets? In this case update could look something like this:
def update(graph, node, edges, undirected=True):
# graph: dict(str->set(str)), node: str, edges: set(str), undirected: bool
if undirected:
for e in graph[node]:
graph[e].remove(node)
for e in edges:
graph[e].add(node)
graph[node] = edges
Using sets and dicts, adding and removing the node to/from the edges-sets of the other nodes should be O(1), same as updating the edges-set for the node itself, so this should be only O(2n) for the two loops, with n being the average number of edges of a node.
Using an adjacency grid would make it O(n) to update, but would take n^2 space, regardless of how sparse the graph is. (Trivially done by updating each changed relationship by inverting the row and column.)
Using lists would put the time up to O(n^2) for updating, but for sparse graphs would not take a huge time penalty, and would save a lot of space.
A typical update is del edge a,e; add edge a,d, but your update looks like a new adjacency list for vertex a. So simply find the a adjacency list and replace it. That should be O(log n) time (assuming sorted array of adjacency lists, like in your description).
Given a list of n comparable elements (say numbers or string), the optimal algorithm to find the ith ordered element takes O(n) time.
Does Python implement natively O(n) time order statistics for lists, dicts, sets, ...?
None of Python's mentioned data structures implements natively the ith order statistic algorithm.
In fact, it might not make much sense for dictionaries and sets, given the fact that both make no assumptions about the ordering of its elements. For lists, it shouldn't be hard to implement the selection algorithm, which provides O(n) running time.
This is not a native solution, but you can use NumPy's partition to find the k-th order statistic of a list in O(n) time.
import numpy as np
x = [2, 4, 0, 3, 1]
k = 2
print('The k-th order statistic is:', np.partition(np.asarray(x), k)[k])
EDIT: this assumes zero-indexing, i.e. the "zeroth order statistic" above is 0.
If i << n you can give a look at http://docs.python.org/library/heapq.html#heapq.nlargest and http://docs.python.org/library/heapq.html#heapq.nsmallest (the don't solve your problem, but are faster than sorting and taking the i-th element).