I have a large set of From/To pairs that represent a hierarchy of connected nodes. As an example, the hierarchy:
4 -- 5 -- 8
/
2 --- 6 - 9 -- 10
/ \
1 -- 11
\
3 ----7
is encapsulated as:
{(11, 9), (10, 9), (9, 6), (6, 2), (8, 5), (5, 4), (4, 2), (2, 1), (3, 1), (7, 3)}
I'd like to be able to create a function that returns all nodes upstream of a given node, e.g.:
nodes[2].us
> [4, 5, 6, 8, 9, 10, 11]
My actual set of nodes is in the tens of thousands, so I'd like to be able to very quickly return a list of all upstream nodes without having to perform recursion over the entire set each time I want to get an upstream set.
This is my best attempt so far, but it doesn't get beyond two levels up.
class Node:
def __init__(self, fr, to):
self.fr = fr
self.to = to
self.us = set()
def build_hierarchy(nodes):
for node in nodes.values():
if node.to in nodes:
nodes[node.to].us.add(node)
for node in nodes.values():
for us_node in node.us.copy():
node.us |= us_node.us
return nodes
from_to = {(11, 9), (10, 9), (9, 6), (6, 2), (8, 5), (5, 4), (4, 2), (2, 1), (3, 1), (7, 3), (1, 0)}
nodes = {fr: Node(fr, to) for fr, to in from_to} # node objects indexed by "from"
nodes = build_hierarchy(nodes)
print [node.fr for node in nodes[2].us]
> [4, 6, 5, 9]
I'll show two ways of doing this. First, we'll simply modify your us attribute to intelligently compute and cache the results of a descendant lookup. Second, we'll use a graph library, networkx.
I'd really recommend you go with the graph library if your data naturally has graph structure. You'll save yourself a lot of hassle that way.
Caching us nodes property
You can make your us attribute a property, and cache the results of previous lookups:
class Node(object):
def __init__(self):
self.name = None
self.parent = None
self.children = set()
self._upstream = set()
def __repr__(self):
return "Node({})".format(self.name)
#property
def upstream(self):
if self._upstream:
return self._upstream
else:
for child in self.children:
self._upstream.add(child)
self._upstream |= child.upstream
return self._upstream
Note that I'm using a slightly different representation than you. I'll create the graph:
import collections
edges = {(11, 9), (10, 9), (9, 6), (6, 2), (8, 5), (5, 4), (4, 2), (2, 1), (3, 1), (7, 3)}
nodes = collections.defaultdict(lambda: Node())
for node, parent in edges:
nodes[node].name = node
nodes[parent].name = parent
nodes[node].parent = nodes[parent]
nodes[parent].children.add(nodes[node])
and I'll lookup the upstream nodes for node 2:
>>> nodes[2].upstream
{Node(5), Node(4), Node(11), Node(9), Node(6), Node(8), Node(10)}
Once the nodes upstream of 2 are computed, they won't be recomputed if you call, for example nodes[1].upstream. If you make any changes to your graph, then, the upstream nodes will be incorrect.
Using networkx
If we use networkx to represent our graph, a lookup of all of the descendants of a node is very simple:
>>> import networkx as nx
>>> from_to = [(11, 9), (10, 9), (9, 6), (6, 2), (8, 5), (5, 4), (4, 2),
(2, 1), (3, 1), (7, 3), (1, 0)]
>>> graph = nx.DiGraph(from_to).reverse()
>>> nx.descendants(graph, 2)
{4, 5, 6, 8, 9, 10, 11}
This doesn't fully answer your question, which seemed to be about optimizing the lookup of descendants so work wasn't repeated on subsequent calls. However, for all we know, networkx.descendants might do some intelligent caching.
So this is what I'd suggest: avoid optimizing prematurely and use the libraries. If networkx.descendants is too slow, then you might investigate the networkx code to see if it caches lookups. If not, you can build your own caching lookup using more primitive networkx functions. My bet is that networkx.descendants will work just fine, and you won't need to go through the extra work.
Here's a function that will calculate the entire upstream list for a single node:
def upstream_nodes(start_node):
result = []
current = start_node
while current.to: # current.to == 0 means we're at the root node
result.append(current.to)
current = nodes[current.to]
return result
You've said that you don't want to iterate over the entire set of nodes each time you query an upstream, but this won't: it will just query that nodes' parent, and its parent, all the way to the root. So if the node is four levels down, it will make four dictionary lookups.
Or, if you want to be really clever, here's a version that will only make each parent lookup once, then store that lookup in the Node object's .us attribute so you never have to calculate the value again. (If your nodes' parent links aren't going to change after the graph has been created, this will work -- if you change your graph, of course, it won't).
def caching_upstream_nodes(start_node, nodes):
# start_node is the Node object whose upstream set you want
# nodes is the dictionary you created mapping ints to Node objects
if start_node.us:
# We already calculated this once, no need to re-calculate
return start_node.us
parent = nodes.get(start_node.to)
if parent is None:
# We're at the root node
start_node.us = set()
return start_node.us
# Otherwise, our upstream is our parent's upstream, plus the parent
parent_upstream = caching_upstream_nodes(parent, nodes)
start_node.us = parent_upstream.copy()
start_node.us.add(start_node.to)
return start_node.us
One of those two functions should be what you're looking for. (NOTE: Exercise a little bit of caution when running these, as I just wrote them but haven't invested the time to test them. I believe the algorithm is correct, but there's always a chance that I made a basic error in writing it.)
Related
Given a list of tuples that represent edges:
edges = [(2, 4), (3, 4), (6, 8), (6, 9), (7, 10), (11, 13)]
I want to merge or blend those edges to get a list of merged tuples, for example (2, 4), (3, 4) will be merged into (2, 4).
The final output of the the list above should look like:
[(2, 4), (6, 10), (11, 13)]
My idea is to use a double for loop to iterate over the list and find intersections and substitute the 2 edges with (min(e1[0], e2[0]), max(e1[1], e2[1])) but this method won't
work for all cases.
Any good thoughts?
Here's my solution:
edges = [(2, 4), (3, 4), (6, 8), (6, 9), (7, 10), (11, 13)]
edges = sorted(edges, key=lambda x:(x[0], -x[1]))
fused = []
i = 0
while i < len(edges):
start,end = edges[i]
for j in range(i+1, len(edges)):
s,e = edges[j]
if s <= end:
# edges[j] is included in the fused range
# Update end to the greater value
end = max(e, end)
else:
break
fused.append((start, end))
del edges[i:j]
print(fused)
Explanation:
The logic works as follows: we sort the list in ascending order of the start values. If two ranges have the same start value, we arrange them in descending order of their end elements. This way two ranges with the same start value will be 'eaten up' by the range with the farther end value.
Now that the list is sorted in this unique way, there's a nice little property here: If you start from the first range, you can decide whether or not you want to fuse with the next range or not. If you do fuse with it, then update the end of the first range to merge with the 'fusable' range. If you decide NOT to fuse with it, then everything from the first range till now will get fused and added to the new list.
edges = sorted(edges, key=lambda x:(x[0], -x[1]))
Sorts edges in ascending order of the start values and descending order of end values.
del edges[i:j]
Deletes all the fused ranges from the original list. This is important because i always points to the new range that we'll start fusing from.
I have a circle-growth algorithm (line-growth with closed links) where new points are added between existing points at each iteration.
The linkage information of each point is stored as a tuple in a list. That list is updated iteratively.
QUESTIONS:
What would be the most efficient way to return the spatial order of these points as a list ?
Do I need to compute the whole order at each iteration or is there a way to cumulatively insert the new points in a orderly manner into that list ?
All I could come up with is the following:
tuples = [(1, 4), (2, 5), (3, 6), (1, 6), (0, 7), (3, 7), (0, 8), (2, 8), (5, 9), (4, 9)]
starting_tuple = [e for e in tuples if e[0] == 0 or e[1] == 0][0]
## note: 'starting_tuple' could be either (0, 7) or (0, 8), starting direction doesn't matter
order = list(starting_tuple) if starting_tuple[0] == 0 else [starting_tuple[1], starting_tuple[0]]
## order will always start from point 0
idx = tuples.index(starting_tuple)
## index of the starting tuple
def findNext():
global idx
for i, e in enumerate(tuples):
if order[-1] in e and i != idx:
ind = e.index(order[-1])
c = 0 if ind == 1 else 1
order.append(e[c])
idx = tuples.index(e)
for i in range(len(tuples)/2):
findNext()
print order
It is working but it is neither elegant (non pythonic) nor efficient.
It seems to me that a recursive algorithm may be more suitable but unfortunately I don't know how to implement such solution.
Also, please note that I'm using Python 2 and can have access to full python packages only (no numpy)
Rather than recursion, this seems more like a dictionary and generator problem to me:
from collections import defaultdict
def findNext(tuples):
previous = 0
yield previous # our first result
dictionary = defaultdict(list)
# [(1, 4), (2, 5), (3, 6), ...] -> {0: [7, 8], 1: [4, 6], 2: [5, 8], ...}
for a, b in tuples:
dictionary[a].append(b)
dictionary[b].append(a)
current = dictionary[0][0] # dictionary[0][1] should also work
yield current # our second result
while True:
a, b = dictionary[current] # possible connections
following = a if a != previous else b # only one will move us forward
if following == 0: # have we come full circle?
break
yield following # our next result
previous, current = current, following # reset for next iteration
tuples = [(1, 4), (2, 5), (3, 6), (1, 6), (7, 0), (3, 7), (8, 0), (2, 8), (5, 9), (4, 9)]
generator = findNext(tuples)
for n in generator:
print n
OUTPUT
% python test.py
0
7
3
6
1
4
9
5
2
8
%
Algorithm currently assumes we have more than two nodes.
Since the nodes only link to two other nodes, you can bin them by number, then follow the numbers around. This is O(n) sorting, which is pretty solid, but it's not a true sort in the <,>,= sense.
def bin_nodes(node_list):
#figure out the in and out nodes for each node, and put those into a dictionary.
node_bins = {} #init the bins
for node_pair in node_list: #go once through the list
for i in range(len(node_pair)): #put each node into the other's bin
if node_pair[i] not in node_bins: #initialize the bin dictionary for unseen nodes
node_bins[node_pair[i]] = []
node_bins[node_pair[i]].append(node_pair[(i+1)%2])
return node_bins
def sort_bins(node_bins):
#go from bin to bin, following the numbers
nodes = [0]*len(node_bins) #allocate a list
nodes[0] = next(iter(node_bins)) #pick an arbitrary one to start
nodes[1] = node_bins[nodes[0]][0] #pick a direction to go
for i in range(2, len(node_bins)):
#one of the two nodes in the bin is the horse we rode in on.
#The other is the next stop.
j = 1 if node_bins[nodes[i-1]][0] == nodes[i-2] else 0 #figure out which one ISN"T the one we came in on
nodes[i] = node_bins[nodes[i-1]][j] #pick the next node, then go to its bin, rinse repeat
return nodes
if __name__ == "__main__":
#test
test = [(1,2),(3,4),(2,4),(1,3)] #should give 1,3,4,2 or some rotation or reversal thereof
print(bin_nodes(test))
print(sort_bins(bin_nodes(test)))
I'm planning to read millions of small files from disk. To minimize i/o, I planned to use a dictionary that maps a file path to its content. I only want the dictionary to retain the last n keys inserted into it, though (so the dictionary will act as a cache).
Is there a data structure in Python that already implements this behavior? I wanted to check before reinventing the wheel.
Use collections.deque for this with a maxlen of 6, so that it stores only the last 6 elements and store the information as key value pairs
from collections import deque
d = deque(maxlen=6)
d.extend([(1,1),(2,2),(3,3),(4,4), (5,5), (6,6)])
d
# deque([(1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6, 6)], maxlen=6)
d.extend([(7,7)])
d
# deque([(2, 2), (3, 3), (4, 4), (5, 5), (6, 6), (7, 7)], maxlen=6)
For my particular problem, since I needed to read files from disk, I think I'll use the lru cache as #PatrickHaugh suggested. Here's one way to use the cache:
from functools import lru_cache
#lru_cache(maxsize=10)
def read_file(file_path):
print(' * reading', file_path)
return file_path # update to return the read file
for i in range(100):
if i % 2 == 0:
i = 0 # test that requests for 0 don't require additional i/o
print(' * value of', i, 'is', read_file(i))
The output shows that requests for 0 do not incur additional i/o, which is perfect.
You can use collections.OrderedDict and its method popitem to ensure you keep only the last n keys added to the dictionary. Specifying last=False with popitem ensures the behaviour is "FIFO", i.e. First-In, First-Out. Here's a trivial example:
from collections import OrderedDict
n = 3
d = OrderedDict()
for i in range(5):
if len(d) == n:
removed = d.popitem(last=False)
print(f'Item removed: {removed}')
d[i] = i+1
print(d)
Item removed: (0, 1)
Item removed: (1, 2)
OrderedDict([(2, 3), (3, 4), (4, 5)])
I have a grid network with a number nodes with (x,y) coordinates, and I have a couple of individuals that visit these nodes in the network. For instance, individual 1 visits nodes (1,3), (4,5), (8,9) and individual 2 visits (4,3), (2,5).
I need to access these nodes for each individual (let say in a for loop for all individuals) but I do not know the best way of doing it in python.
You can create a class called Individual to hold all the relevant information for that individual. You can then put those Individual objects in a list or whatever data structure you want.
class Individual:
def __init__(self, visited):
self.visited = visited # type: list[tuple]
def add_visit(self, node):
self.visited.append(node)
individuals = [
Individual([(1, 3), (4, 5), (8, 9)]),
Individual([(4, 3), (2, 5)])
]
for individual in individuals:
pass # do stuff
Others suggest a class for this task, but I think you would be better off with just a normal dictionary (dict), or defaultdict. No need to create a class for a thing that has no methods and only contains a list of nodes, especially when python has such a fantastic arsenal of containers.
Solution with a dict in Python 3:
individuals = {}
individuals["1"] = [(1, 3), (4, 5), (8, 9)]
individuals["2"] = [(4, 3), (2, 5)]
for ind, node in individuals.items():
print(ind, node)
individuals["2"].append((6, 7))
I have a list of edges. I need to decode a path from source node to sink node from them. There might be loops in my paths, but I should only use each of the edges once. In my list, I might also have the same edge for more than one time, which means in my path I should pass it more than once.
Lets say my edges list as following:
[(1, 16), (9, 3), (8, 9), (15, 8), (5, 1), (8, 15), (3, 5)]
so my path is:
8->15->8->9->3->5->1->16 equivalent to [8,15,8,9,3,5,1,16]
I know the sink node and the source node. (In above sample I knew that 8 is source and 16 is sink) here is another sample with more than one usage of the same edge:
[(1,2),(2,1),(2,3),(1,2)]
the path is:
1->2->1->2->3 equivalent to [1,2,1,2,3]
Basically it is type of topological sorting but, we don't have loops in topological sorting. I have the following code, but it does not use the nodes in the loops !
def find_all_paths(graph, start, end):
path = []
paths = []
queue = [(start, end, path)]
while queue:
start, end, path = queue.pop()
print 'PATH', path
path = path + [start]
if start == end:
paths.append(path)
for node in set(graph[start]).difference(path):
queue.append((node, end, path))
return paths
Simply, you may need to do more than one pass over the edges to assemble a path using all the edges.
The included code operates on the following assumptions:
A solution exists. Namely all vertices belong to a single connected component of an underlying graph and
in_degree = out_degree for either all or all but 2 vertices. In the latter case one of the vertices has in_degree - out_degree = 1 and the other has in_degree - out_degree = -1.
Furthermore even with these conditions, there is not necessarily a unique solution to the problem of finding a path from source to sink utilizing all edges. This code only finds one solution and not all solutions. (An example where multiple solutions exist is a 'daisy' [(1,2),(2,1),(1,3),(3,1),(1,4),(4,1),(1,5),(5,1)] where the start and end are the same.)
The idea is to create a dictionary of all edges for the path indexed by the starting node for the edge and then remove edges from the dictionary as they are added to the path. Rather than trying to get all of the edges in the path in the first pass, we go over the dictionary multiple times until all of the edges are used. The first pass creates a path from source to sink. Subsequent passes add in loops.
Warning: There is almost no consistency checking or validation. If the start is not a valid source for the edges then the 'path' returned will be disconnected!
"""
This is a basic implementatin of Hierholzer's algorithm as applied to the case of a
directed graph with perhaps multiple identical edges.
"""
import collections
def node_dict(edge_list):
s_dict = collections.defaultdict(list)
for edge in edge_list:
s_dict[edge[0]].append(edge)
return s_dict
def get_a_path(n_dict,start):
"""
INPUT: A dictionary whose keys are nodes 'a' and whose values are lists of
allowed directed edges (a,b) from 'a' to 'b', along with a start WHICH IS
ASSUMED TO BE IN THE DICTIONARY.
OUTPUT: An ordered list of initial nodes and an ordered list of edges
representing a path starting at start and ending when there are no other
allowed edges that can be traversed from the final node in the last edge.
NOTE: This function modifies the dictionary n_dict!
"""
cur_edge = n_dict[start][0]
n_dict[start].remove(cur_edge)
trail = [cur_edge[0]]
path = [cur_edge]
cur_node = cur_edge[1]
while len(n_dict[cur_node]) > 0:
cur_edge = n_dict[cur_node][0]
n_dict[cur_node].remove(cur_edge)
trail.append(cur_edge[0])
path.append(cur_edge)
cur_node = cur_edge[1]
return trail, path
def find_a_path_with_all_edges(edge_list,start):
"""
INPUT: A list of edges given by ordered pairs (a,b) and a starting node.
OUTPUT: A list of nodes and an associated list of edges representing a path
where each edge is represented once and if the input had a valid Eulerian
trail starting from start, then the lists give a valid path through all of
the edges.
EXAMPLES:
In [2]: find_a_path_with_all_edges([(1,2),(2,1),(2,3),(1,2)],1)
Out[2]: ([1, 2, 1, 2, 3], [(1, 2), (2, 1), (1, 2), (2, 3)])
In [3]: find_a_path_with_all_edges([(1, 16), (9, 3), (8, 9), (15, 8), (5, 1), (8, 15), (3, 5)],8)
Out[3]:
([8, 15, 8, 9, 3, 5, 1, 16],
[(8, 15), (15, 8), (8, 9), (9, 3), (3, 5), (5, 1), (1, 16)])
"""
s_dict = node_dict(edge_list)
trail, path_check = get_a_path(s_dict,start)
#Now add in edges that were missed in the first pass...
while max([len(s_dict[x]) for x in s_dict]) > 0:
#Note: there may be a node in a loop we don't have on trail yet
add_nodes = [x for x in trail if len(s_dict[x])>0]
if len(add_nodes) > 0:
skey = add_nodes[0]
else:
print "INVALID EDGE LIST!!!"
break
temp,ptemp = get_a_path(s_dict,skey)
i = trail.index(skey)
if i == 0:
trail = temp + trail
path_check = ptemp + path_check
else:
trail = trail[:i] + temp + trail[i:]
path_check = path_check[:i] + ptemp + path_check[i:]
#Add the final node to trail.
trail.append(path_check[-1][1])
return trail, path_check