Memory utilization in recursive vs an iterative graph traversal

Memory utilization in recursive vs an iterative graph traversal - python

I have looked at some common tools like Heapy to measure how much memory is being utilized by each traversal technique but I don't know if they are giving me the right results. Here is some code to give the context.
The code simply measures the number of unique nodes in a graph. Two traversal techniques provided viz. count_bfs and count_dfs
import sys
from guppy import hpy
class Graph:
def __init__(self, key):
self.key = key #unique id for a vertex
self.connections = []
self.visited = False
def count_bfs(start):
parents = [start]
children = []
count = 0
while parents:
for ind in parents:
if not ind.visited:
count += 1
ind.visited = True
for child in ind.connections:
children.append(child)
parents = children
children = []
return count
def count_dfs(start):
if not start.visited:
start.visited = True
else:
return 0
n = 1
for connection in start.connections:
n += count_dfs(connection)
return n
def construct(file, s=1):
"""Constructs a Graph using the adjacency matrix given in the file
:param file: path to the file with the matrix
:param s: starting node key. Defaults to 1
:return start vertex of the graph
"""
d = {}
f = open(file,'rU')
size = int(f.readline())
for x in xrange(1,size+1):
d[x] = Graph(x)
start = d[s]
for i in xrange(0,size):
l = map(lambda x: int(x), f.readline().split())
node = l[0]
for child in l[1:]:
d[node].connections.append(d[child])
return start
if __name__ == "__main__":
s = construct(sys.argv[1])
#h = hpy()
print(count_bfs(s))
#print h.heap()
s = construct(sys.argv[1])
#h = hpy()
print(count_dfs(s))
#print h.heap()
I want to know by what factor is the total memory utilization different in the two traversal techniques viz. count_dfs and count_bfs? One might have the intuition that dfs may be expensive as a new stack is created for every function call. How can the total memory allocations in each traversal technique be measured?
Do the (commented) hpy statements give the desired measure?
Sample file with connections:
4
1 2 3
2 1 3
3 4
4

This being a python question, it may be more important how much stack space is used than how much total memory. Cpython has a low limit of 1000 frames because it shares its call stack with the c call stack, which in turn is limited to the order of one megabyte in most places. For this reason you should almost* always prefer iterative solutions to recursive ones when the recursion depth is unbounded.
* other implementations of python may not have this restriction. The stackless variants of cpython and pypy have this exact property

For your specific problem, I don't know if there's going to be an easy solution. That's because, the peak memory usage of a graph traversal depends on the details of the graph itself.
For a depth-first traversal, the greatest usage will come when the algorithm has gone to the deepest depth. In your example graph, it will traverse 1->2->3->4, and create a stack frame for each level. So while it is at 4 it has allocated the most memory.
For the breadth-first traversal, the memory used will be proportional to the number of nodes at each depth plus the number of child nodes at the next depth. Those values are stored in lists, which are probably more efficient than stack frames. In the example, since the first node is connected to all the others, it happens immediately during the first step [1]->[2,3,4].
I'm sure there are some graphs that will do much better with one search or the other.
For example, imagine a graph that looked like a linked list, with all the vertices in a single long chain. The depth-first traversal will have a very-high peak memory useage, since it will recurse all the way down the chain, allocating a stack frame for each level. The breadth-first traversal will use much less memory, since it will only have a single vertex with a single child to keep track of on each step.
Now, contrast that with a graph that is a depth-2 tree. That is, there's a single root element that is connected to a great many children, none of which are connected to each other. The depth first traversal will not use much memory at any given time, as it will only need to traverse two nodes before it has to back up and try another branch. The depth-first traversal on the other hand will be putting all of the child nodes in memory at once, which for a big tree could be problematic.
Your current profiling code won't find the peak memory usage you want, because it only finds the memory used by objects on the heap at the time you call heap. That's likely to be the the same before and after your traversals. Instead, you'll need to insert profiling code into the traversal functions themselves. I can't find a pre-built package of guppy to try it myself, but I think this untested code will work:
from guppy import hpy
def count_bfs(start):
hp = hpy()
base_mem = hpy.heap().size
max_mem = 0
parents = [start]
children = []
count = 0
while parents:
for ind in parents:
if not ind.visited:
count += 1
ind.visited = True
for child in ind.connections:
children.append(child)
mem = hpy.heap().size - base_mem
if mem > max_mem:
max_mem = mem
parents = children
children = []
return count, max_mem
def count_dfs(start, hp=hpy(), base_mem=None):
if base_mem is None:
base_mem = hp.heap().size
if not start.visited:
start.visited = True
else:
return 0, hp.heap().size - base_mem
n = 1
max_mem = 0
for connection in start.connections:
c, mem = count_dfs(connection, base_mem)
if mem > max_mem:
max_mem = mem
n += c
return n, max_mem
Both traversal functions now return a (count, max-memory-used) tuple. You can try them out on a variety of graphs to see what the differences are.

It's tough to measure exactly how much memory is being used because systems vary in how they implement stack frames. Generally speaking, recursive algorithms use far more memory than iterative algorithms because each stack frame must store the state of its variables whenever a new function call occurs. Consider the difference between dynamic programming solutions and recursive solutions. Runtime is far faster on an iterative implementation of an algorithm than a recursive one.
If you really must know how much memory your code uses, load your software in a debugger such as OllyDbg (http://www.ollydbg.de/) and count the bytes. Happy coding!

Of the two, depth-first uses less memory if most traversals end up hitting most of the graph.
Breadth-first can be better when the target is near the starting node, or when the number of nodes doesn't go up very quickly so the parents/children arrays in your code stay small (e.g. another answer mentioned linked list as worst-case for DFS).
If the graph you're searching is spatial data, or has what's known as an "admissible heuristic," A* is another algorithm that's pretty good: http://en.wikipedia.org/wiki/A_star
However, premature optimization is the root of all evil. Look at the actual data you want to use; if it fits in a reasonable amount of memory, and the search runs in a reasonable time, it doesn't matter which algorithm you use. NB, what's "reasonable" depends on the application you're using it for and the amount of resources on the hardware that will be running it.

For either search order implemented iteratively with the standard data structure describing it (queue for BFS, stack for DFS), I can construct a graph that uses O(n) memory trivially. For BFS, it's an n-star, and for DFS it's an n-chain. I don't believe either of them can be implemented for the general case to do better than that, so that also gives an Omega(n) lower bound on maximum memory usage. So, with efficient implementations of each, it should generally be a wash.
Now, if your input graphs have some characteristics that bias them more toward one of those extremes or the other, that might inform your decision on which to use in practice.

Related

Shortest path between two nodes with fixed number of nodes in path

I have a weighted graph with around 800 nodes, each with a number of connections ranging from 1 to around 300. I need to find the shortest (lowest cost) path between two nodes with some extra criteria:
The path must contain exactly five nodes.
Each node has an attribute (called position in the example code) that takes one of five values; the five nodes in the path must all have unique values for this attribute.
The algorithm needs to allow for 1-2 required nodes to be specified that the path must contain at some point in any order.
The algorithm needs to take less than 10 seconds to run, preferably as little time as possible while losing as little accuracy as possible.
My current solution in Python is to run a Depth-Limited Depth-First Search which recursively searches every possible path. To make this algorithm run in reasonable time I have introduced a limit to the number of neighbour nodes that are searched at each recursion level. This number can be lowered to decrease the computation time but at the cost of accuracy. Currently this algorithm is far too slow, with my most recent test coming in at 75 seconds with a neighbour limit of 30. If I decrease this neighbour limit any more, my testing shows that the accuracy of the algorithm begins to suffer badly. I am out of ideas on how to solve this problem while satisfying all of the above criteria. My code is as follows:
# The path must go from start -> end, be of length 5 and contain all nodes in middle
# Each node is represented as a tuple: (value, position)
def dfs(start, end, middle=[], path=Path(), best=Path([], math.inf)):
# If this is the first level of recursion, initialise the path variable
if len(path) == 0:
path = Path([start])
# If the max depth has been exceeded, check if the current node is the goal node
if len(path) >= depth:
# If it is, save the path
# Check that all required nodes have been visited
if len(path) == depth and start == end and path.cost < best.cost and all(x in path.path for x in middle):
# Store the new best path
best.path = path.path
best.cost = path.cost
return
# Run DFS on all of the neighbors of the node that haven't been searched already
# Use the weights of the neighbors as a heuristic; sort by lowest weight first
neighbors = sorted([x for x in graph.get(*start).connected_nodes], key=lambda x: graph.weight(start, x))
# Make sure that all neighbors haven't been visited yet and that their positions aren't already accounted for
positions = [x[1] for x in path.path]
# Only visit neighbouring nodes with unique positions and ids
filtered = [x for x in neighbors if x not in path.path and x[1] not in positions]
for neighbor in filtered[:neighbor_limit]:
if neighbor not in path.path:
dfs(neighbor, end, middle, Path(path.path + [neighbor], path.cost + graph.weight(start, neighbor)), best)
return best
Path Class:
class Path:
def __init__(self, path=[], cost=0):
self.path = path
self.cost = cost
def __len__(self):
return len(self.path)
Any help in improving this algorithm or even suggestions on a better approach to the problem would be much appreciated, thanks in advance!

You should iterate over all possible orderings of the 'position' attribute, and for each one use Dijkstra's algorithm or BFS to find the shortest path that respects that ordering.
Since you know the position of the first and last nodes, there are only 3! = 6 different orderings for the intermediate nodes, so you only have to run Dijkstra's algorithm 6 times.
Even in python, this shouldn't take more than a couple hundred milliseconds to run, based on the input sizes you provided.

Systolic Array Simulation in Python

I am trying to simulate a systolic array structure -- all of which I learned from these slides: http://web.cecs.pdx.edu/~mperkows/temp/May22/0020.Matrix-multiplication-systolic.pdf -- for matrix multiplication in a Python environment. An integral part of a systolic array is that data flow between PE's is concurrent with any multiplication or addition that occurs on any one node. I am having difficulty surmising how exactly to implement such a concurrent procedure in Python. Specifically, I hope to better understand a computational approach to feed the elements of the matrices to be multiplied into the systolic array in a cascading fashion, while allowing these elements to propagate through the array in a concurrent fashion.
I have begun writing some code in python to multiple two 3 by 3 array's, but ultimately, I want to simulate any sized systolic array to work with any sized a and b matrices.
from threading import Thread
from collections import deque
vals_deque = deque(maxlen=9*2)#will hold the interconnections between nodes of the systolicarray
dump=deque(maxlen=9) # will be the output of the SystolicArray
prev_size = 0
def setupSystolicArray():
global SystolicArray
SystolicArray = [NodeSystolic(i,j) for i in range(3), for i in range(3)]
def spreadInputs(a,b):
#needs some way to initially propagate the elements of a and b through the top and leftmost parts of the systolic array
new = map(lambda x: x.start() , SystolicArray) #start all the nodes of the systolic array, they are waiting for an input
#even if i found a way to put these inputs into the array at the start, I am not sure how to coordinate future inputs into the array in the cascading fashion described in the slides
while(len(dump) != 9):
if(len(vals_deque) != prev_size):
vals = vals_deque[-1]
row = vals['t'][0]
col = vals['l'][0]
a= vals['t'][1]
b = vals['l'][1]
# these if elif statements track whether the outputs are at the edge of the systolic array and can be removed
if(row >= 3):
dump.append(a)
elif(col >= 3):
dump.append(b)
else:
#something is wrong with the logic here
SystolicArray[t][l-1].update(a,b)
SystolicArray[t-1][l].update(a,b)
class NodeSystolic:
def __init__(self,row, col):
self.row = row
self.col = col
self.currval = 0
self.up = False
self.ain = 0#coming in from the top
self.bin = 0#coming in from the side
def start(self):
Thread(target=self.continuous, args = ()).start()
def continuous(self):
while True:
if(self.up = True):
self.currval = self.ain*self.bin
self.up = False
self.passon(self.ain, self.bin)
else:
pass
def update(self, left, top):
self.up = True
self.ain = top
self.bin = left
def passon(self, t,l):
#this will passon the inputs this node has received onto the next nodes
vals_deque.append([{'t': [self.row+ 1, self.ain], 'l': [self.col + 1, self.bin]}])
def returnValue(self):
return self.currval
def main():
a = np.array([
[1,2,3],
[4,5,6],
[7,8,9],
])
b = np.array([
[1,2,3],
[4,5,6],
[7,8,9]
])
setupSystolicArray()
spreadInputs(a,b)
The above code is not operational and still has many bugs. I was hoping someone could give me pointers on how to improve the code, or whether there is a much simpler way to model the parallel procedures of a systolic array with the asynchronous properties in Python, so with very large systolic array sizes, I won't have to worry about creating too many threads (nodes).

It's interesting to think about simulating a systolic array in Python, but I think there are some significant difficulties in doing this along the lines you've sketched out above.
Most importantly there are the issues about Python's limited scope for true parallelism caused by the Global Interpreter Lock. This means that you won't get any significant parallelism for compute-limited tasks, and its threads are probably best suited to handling I/O limited tasks such as web-requests or filesystem accesses. The nearest Python can get to this is probably via the multiprocessing module, but that would require separate process for each node.
Secondly, even if you were going to get parallelism in the numerical operations within your systolic array, you'd need to have some locking mechanisms to allow different threads to exchange data (or messages) without corrupting each other's memory when they try to read and write data at the same time.
As regards the datastructures in your example, I think you might be better off having each node in the systolic array having a reference to its upstream nodes, rather than knowing that it lies at a particular location in an NxM grid. I don't think there's any reason why a systolic array needs to be a rectangular grid, and any from of Directed Acyclic Graph (DAG) would still have the potential for efficient distributed computation.
Overall, I'd expect the computational overheads of doing this simulation in Python to be enormous relative to what could be achieved by lower-level languages such as Scala or C++. Even then, unless each node in the systolic array is doing a lot of computation (i.e. much more than a few multiply-adds), then the overheads of exchanging messages between nodes will be substantial. So, I presume your simulation is mainly to get an understanding of the data flows, and the high-level behaviour of the array, rather than to get anywhere close to what could be provided by custom DSP (Digital Signal Processing) hardware. If that's the case, then I'd be tempted just to do without the threading and use a centralized message-queue to which all nodes submit messages that are delivered by a global message-distribution mechanism.

Is this code fragment a genetic algorithm?

I am trying to learn genetic algorthms and AI developement and I copied this code from a book, but I don't know if it is a proper genetic algorithm.
Here is the code (main.py):
import random
geneSet = " abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!.,1234567890-_=+!##$%^&*():'[]\""
target = input()
def generate_parent(length):
genes = []
while len(genes) < length:
sampleSize = min(length - len(genes), len(geneSet))
genes.extend(random.sample(geneSet, sampleSize))
parent = ""
for i in genes:
parent += i
return parent
def get_fitness(guess):
total = 0
for i in range(len(target)):
if target[i] == guess[i]:
total = total + 1
return total
"""
return sum(1 for expected, actual in zip(target, guess)
if expected == actual)
"""
def mutate(parent):
index = random.randrange(0, len(parent))
childGenes = list(parent)
newGene, alternate = random.sample(geneSet, 2)
if newGene == childGenes[index]:
childGenes[index] = alternate
else:
childGenes[index] = newGene
child = ""
for i in childGenes:
child += i
return child
random.seed()
bestParent = generate_parent(len(target))
bestFitness = get_fitness(bestParent)
print(bestParent)
while True:
child = mutate(bestParent)
childFitness = get_fitness(child)
if bestFitness >= childFitness:
continue
print(str(child) + "\t" + str(get_fitness(child)))
if childFitness >= len(bestParent):
break
bestFitness = childFitness
bestParent = child
I saw that it has the fitness function and the mutate function, but it doesn't generate a population and I don't understand why. I thaught that a genetic algorithm needs a population generation and a crossover from the best population members to the new generation. Is this a proper genetic algorithm?

Although there are a lot of ambiguous definitions in the field of AI, my understanding is that:
An evolutionary algorithm (AE) is an algorithm that has a (set of) solution(s) and by mutating them somehow (crossover is here also seen as "mutating"), you eventually end up with better solution(s).
A genetic algorithm (GA) supports the concept of a crossover where two or more "solutions" produce new solutions.
But the terms are sometimes mixed. Mind however that crossover is definitely not the only way to produce new individuals (there are more ways than genetic algorithms to produce better solutions), like:
Simulated Annealing (SA);
Tabu Search (TS);
...
But as said earlier there is always a lot of discussion what the terms really mean and most papers on probabilistic combinatorial optimization state clearly what they mean with the terms.
So according to the above definition, your program is an evolutionary algorithm, but not a genetic one: it always has a population of one after each iteration. Furthermore your program only accepts a new child if it is better than its parent making it a Local Search (LS) algorithm. The problem with local search algorithms is that - if the mutation space of some/all solutions is a subset of the solution space - local search algorithms can get stuck forever in a local optimum. Even if that is not the case, they can get stuck in a local optimum for a very long time.
Here that is not a problem since there are no local optima (but this is of course an easy problem). More hard (and interesting) problems usually have (a lot) of local optima.
Local Search is not a bad technique if it collaborates with other techniques that help get the system out of the local optimum again. Other evolutionary techniques such as simulated annealing will accept a worse solution with small probability (depending how much worse the solution is, and how far we are in the evolutionary process).

Reducing memory usage for dynamic programming implementation of TSP

I'm trying to implement dynamic programming to solve TSP for a set of 25 cities.
With the straight-forward implementation the program I'm facing 'memory error'. I'm trying to solve it by taking into account that my implementation needs only subsets which are of size 1 less than the current subsets, i.e. the sub-problem in dynamic programming is the current set with one element removed. Hence only the distances corresponding to current and previous sizes,(sizes vary from 2 to 25), are stored. Still, the program eats almost 85% of memory and then throws error. My system has 4GB of RAM.
How can I fix the code so that its memory usage doesn't explode?
def make_set(i):
'''Generates all subsets of size i(integer)
from cities, including the element 1'''
cities = range(2,26)
subs = comb(cities,i)
for j in subs:
j=set(j)
j.add(1)
yield j
def tsp(cities,dist):
'''Performs search on subsets to find shortest path that traverses
all cities starting and ending at 1
Returns a dictionary, contains all
needs to be processed further to find the subset_dist answer'''
subset_dist={}
subset_dist[(frozenset([1]),1)]=0
for i in range(1,25):
subset_dist_new={}
subsets = make_set(i) #generator for subsets of size i
for S in subsets:
for j in S: # j is the destination
if j!=1:
min_dist= float('inf')
S_minusJ = set(S)
S_minusJ.remove(j)
for k in S: #k defines the subproblem, the closest city before j, for current subset
if k==1 and len(S)>2:
continue
if k!=j:
min_dist = min(min_dist,subset_dist[(frozenset(S_minusJ),k)]+dist[frozenset([j,k])])
subset_dist_new[(frozenset(S),j)] = min_dist
subset_dist = subset_dist_new
return subset_dist_new

Finding components of very large graph

I have a very large graph represented in a text file of size about 1TB with each edge as follows.
From-node to-node
I would like to split it into its weakly connected components. If it was smaller I could load it into networkx and run their component finding algorithms. For example
http://networkx.github.io/documentation/latest/reference/generated/networkx.algorithms.components.connected.connected_components.html#networkx.algorithms.components.connected.connected_components
Is there any way to do this without loading the whole thing into memory?

If you have few enough nodes (e.g. a few hundred million), then you could compute the connected components with a single pass through the text file by using a disjoint set forest stored in memory.
This data structure only stores the rank and parent pointer for each node so should fit in memory if you have few enough nodes.
For larger number of nodes, you could try the same idea, but storing the data structure on disk (and possibly improved by using a cache in memory to store frequently used items).
Here is some Python code that implements a simple in-memory version of disjoint set forests:
N=7 # Number of nodes
rank=[0]*N
parent=range(N)
def Find(x):
"""Find representative of connected component"""
if parent[x] != x:
parent[x] = Find(parent[x])
return parent[x]
def Union(x,y):
"""Merge sets containing elements x and y"""
x = Find(x)
y = Find(y)
if x == y:
return
if rank[x]<rank[y]:
parent[x] = y
elif rank[x]>rank[y]:
parent[y] = x
else:
parent[y] = x
rank[x] += 1
with open("disjointset.txt","r") as fd:
for line in fd:
fr,to = map(int,line.split())
Union(fr,to)
for n in range(N):
print n,'is in component',Find(n)
If you apply it to the text file called disjointset.txt containing:
1 2
3 4
4 5
0 5
it prints
0 is in component 3
1 is in component 1
2 is in component 1
3 is in component 3
4 is in component 3
5 is in component 3
6 is in component 6
You could save memory by not using the rank array, at the cost of potentially increased computation time.

If even the number of nodes is too large to fit in memory, you can divide and conquer and use external memory sorts to do most of the work for you (e.g. the sort command included with Windows and Unix can sort files much larger than memory):
Choose some threshold vertex k.
Read the original file and write each of its edges to one of 3 files:
To a if its maximum-numbered vertex is < k
To b if its minimum-numbered vertex is >= k
To c otherwise (i.e. if it has one vertex < k and one vertex >= k)
If a is small enough to solve (find connected components for) in memory (using e.g. Peter de Rivaz's algorithm) then do so, otherwise recurse to solve it. The solution should be a file whose lines each consist of two numbers x y and which is sorted by x. Each x is a vertex number and y is its representative -- the lowest-numbered vertex in the same component as x.
Do likewise for b.
Sort edges in c by their smallest-numbered endpoint.
Go through each edge in c, renaming the endpoint that is < k (remember, there must be exactly one such endpoint) to its representative, found from the solution to the subproblem a. This can be done efficiently by using a linear-time merge algorithm to merge with the solution to the subproblem a. Call the resulting file d.
Sort edges in d by their largest-numbered endpoint. (The fact that we have already renamed the smallest-numbered endpoint doesn't make this unsafe, since renaming can never increase a vertex's number.)
Go through each edge in d, renaming the endpoint that is >= k to its representative, found from the solution to the subproblem b using a linear-time merge as before. Call the resulting file e.
Solve e. (As with a and b, do this directly in memory if possible, otherwise recurse. If you need to recurse, you will need to find a different way of partitioning the edges, since all the edges in e already "straddle" k. You could for example renumber vertices using a random permutation of vertex numbers, recurse to solve the resulting problem, then rename them back.) This step is necessary because there could be an edge (1, k), another edge (2, k+1) and a third edge (2, k), and this will mean that all vertices in the components 1, 2, k and k+1 need to be combined into a single component.
Go through each line in the solution for subproblem a, updating the representative for this vertex using the solution to subproblem e if necessary. This can be done efficiently using a linear-time merge. Write out the new list of representatives (which will already be sorted by vertex number due to the fact that we created it from a's solution) to a file f.
Do likewise for each line in the solution for subproblem b, creating file g.
Concatenate f and g to produce the final answer. (For better efficiency, just have step 11 append its results directly to f).
All the linear-time merge operations used above can read directly from disk files, since they only ever access items from each list in increasing order (i.e. no slow random access is needed).

External memory graph traversal is tricky to get performant. I advise against writing your own code, implementation details make the difference between a runtime of a few hours and a runtime of a few months. You should consider using existing libraries like the stxxl. See here for a paper using it to compute connected components.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.