Systolic Array Simulation in Python - python

I am trying to simulate a systolic array structure -- all of which I learned from these slides: -- for matrix multiplication in a Python environment. An integral part of a systolic array is that data flow between PE's is concurrent with any multiplication or addition that occurs on any one node. I am having difficulty surmising how exactly to implement such a concurrent procedure in Python. Specifically, I hope to better understand a computational approach to feed the elements of the matrices to be multiplied into the systolic array in a cascading fashion, while allowing these elements to propagate through the array in a concurrent fashion.
I have begun writing some code in python to multiple two 3 by 3 array's, but ultimately, I want to simulate any sized systolic array to work with any sized a and b matrices.
from threading import Thread
from collections import deque
vals_deque = deque(maxlen=9*2)#will hold the interconnections between nodes of the systolicarray
dump=deque(maxlen=9) # will be the output of the SystolicArray
prev_size = 0
def setupSystolicArray():
global SystolicArray
SystolicArray = [NodeSystolic(i,j) for i in range(3), for i in range(3)]
def spreadInputs(a,b):
#needs some way to initially propagate the elements of a and b through the top and leftmost parts of the systolic array
new = map(lambda x: x.start() , SystolicArray) #start all the nodes of the systolic array, they are waiting for an input
#even if i found a way to put these inputs into the array at the start, I am not sure how to coordinate future inputs into the array in the cascading fashion described in the slides
while(len(dump) != 9):
if(len(vals_deque) != prev_size):
vals = vals_deque[-1]
row = vals['t'][0]
col = vals['l'][0]
a= vals['t'][1]
b = vals['l'][1]
# these if elif statements track whether the outputs are at the edge of the systolic array and can be removed
if(row >= 3):
elif(col >= 3):
#something is wrong with the logic here
class NodeSystolic:
def __init__(self,row, col):
self.row = row
self.col = col
self.currval = 0
self.up = False
self.ain = 0#coming in from the top
self.bin = 0#coming in from the side
def start(self):
Thread(target=self.continuous, args = ()).start()
def continuous(self):
while True:
if(self.up = True):
self.currval = self.ain*self.bin
self.up = False
self.passon(self.ain, self.bin)
def update(self, left, top):
self.up = True
self.ain = top
self.bin = left
def passon(self, t,l):
#this will passon the inputs this node has received onto the next nodes
vals_deque.append([{'t': [self.row+ 1, self.ain], 'l': [self.col + 1, self.bin]}])
def returnValue(self):
return self.currval
def main():
a = np.array([
b = np.array([
The above code is not operational and still has many bugs. I was hoping someone could give me pointers on how to improve the code, or whether there is a much simpler way to model the parallel procedures of a systolic array with the asynchronous properties in Python, so with very large systolic array sizes, I won't have to worry about creating too many threads (nodes).

It's interesting to think about simulating a systolic array in Python, but I think there are some significant difficulties in doing this along the lines you've sketched out above.
Most importantly there are the issues about Python's limited scope for true parallelism caused by the Global Interpreter Lock. This means that you won't get any significant parallelism for compute-limited tasks, and its threads are probably best suited to handling I/O limited tasks such as web-requests or filesystem accesses. The nearest Python can get to this is probably via the multiprocessing module, but that would require separate process for each node.
Secondly, even if you were going to get parallelism in the numerical operations within your systolic array, you'd need to have some locking mechanisms to allow different threads to exchange data (or messages) without corrupting each other's memory when they try to read and write data at the same time.
As regards the datastructures in your example, I think you might be better off having each node in the systolic array having a reference to its upstream nodes, rather than knowing that it lies at a particular location in an NxM grid. I don't think there's any reason why a systolic array needs to be a rectangular grid, and any from of Directed Acyclic Graph (DAG) would still have the potential for efficient distributed computation.
Overall, I'd expect the computational overheads of doing this simulation in Python to be enormous relative to what could be achieved by lower-level languages such as Scala or C++. Even then, unless each node in the systolic array is doing a lot of computation (i.e. much more than a few multiply-adds), then the overheads of exchanging messages between nodes will be substantial. So, I presume your simulation is mainly to get an understanding of the data flows, and the high-level behaviour of the array, rather than to get anywhere close to what could be provided by custom DSP (Digital Signal Processing) hardware. If that's the case, then I'd be tempted just to do without the threading and use a centralized message-queue to which all nodes submit messages that are delivered by a global message-distribution mechanism.


Efficient Way to Repeatedly Split Large NumPy Array and Record Middle

I have a large NumPy array nodes = np.arange(100_000_000) and I need to rearrange this array by:
Recording and then removing the middle value in the array
Split the array into the left half and right half
Repeat Steps 1-2 for each half
Stop when all values are exhausted
So, for a smaller input example nodes = np.arange(10), the output would be:
[5 2 8 1 4 7 9 0 3 6]
This was accomplished by naively doing:
import numpy as np
def split(node, out):
mid = len(node) // 2
return node[:mid], node[mid+1:]
def reorder(a):
nodes = [a.tolist()]
out = []
while nodes:
tmp = []
for node in nodes:
for n in split(node, out):
if n:
nodes = tmp
return np.array(out)
if __name__ == "__main__":
nodes = np.arange(10)
However, this is way too slow for nodes = np.arange(100_000_000) and so I am looking for a much faster solution.
You can vectorize your function with Numpy by working on groups of slices.
Here is an implementation:
# Similar to [e for tmp in zip(a, b) for e in tmp] ,
# but on Numpy arrays and much faster
def interleave(a, b):
assert len(a) == len(b)
return np.column_stack((a, b)).reshape(len(a) * 2)
# n is the length of the input range (len(a) in your example)
def fast_reorder(n):
if n == 0:
return np.empty(0, dtype=np.int32)
startSlices = np.array([0], dtype=np.int32)
endSlices = np.array([n], dtype=np.int32)
allMidSlices = np.empty(n, dtype=np.int32) # Similar to "out" in your implementation
midInsertCount = 0 # Actual size of allMidSlices
# Generate a bunch of middle values as long as there is valid slices to split
while midInsertCount < n:
# Generate the new mid/left/right slices
midSlices = (endSlices + startSlices) // 2
# Computing the next slices is not needed for the last step
if midInsertCount + len(midSlices) < n:
# Generate the nexts slices (possibly with invalid ones)
newStartSlices = interleave(startSlices, midSlices+1)
newEndSlices = interleave(midSlices, endSlices)
# Discard invalid slices
isValidSlices = newStartSlices < newEndSlices
startSlices = newStartSlices[isValidSlices]
endSlices = newEndSlices[isValidSlices]
# Fast appending
allMidSlices[midInsertCount:midInsertCount+len(midSlices)] = midSlices
midInsertCount += len(midSlices)
return allMidSlices[0:midInsertCount]
On my machine, this is 89 times faster than your scalar implementation with the input np.arange(100_000_000) dropping from 2min35 to 1.75s. It also consume far less memory (rougthly 3~4 times less). Note that if you want a faster code, then you probably need to use a native language like C or C++.
The question has been updated to have a much smaller input array so I leave the below for historical reasons. Basically it was likely a typo but we often get accustomed to computers working with insanely large numbers and when memory is involved they can be a real problem.
There is already a numpy based solution submitted by someone else that I think fits the bill.
Your code requires an insane amount of RAM just to hold 100 billion 64 bit integers. Do you have 800GB of RAM? Then you convert the numpy array to a list which will be substantially larger than the array (each packed 64 bit int in the numpy array will become a much less memory efficient python int object and the list will have a pointer to that object). Then you make a lot of slices of the list which will not duplicate the data but will duplicate the pointers to the data and use even more RAM. You also append all the result values to a list a single value at a time. Lists are very fast for adding items generally but with such an extreme size this will not only be slow but the way the list is allocated is likely to be extremely wasteful RAM wise and contribute to major problems (I believe they double in size when they get to a certain level of fullness so you will end up allocating more RAM than you need and doing many allocations and likely copies). What kind of machine are you running this on? There are ways to improve your code but unless you're running it on a super computer I don't know that you're going to ever finish that calculation. I only..only? have 32GB of RAM and I'm not going to even try to create a 100B int_64 numpy array as I don't want to use up ssd write life for a mass of virtual memory.
As for improving your code stick to numpy arrays don't change to a python list it will greatly increase the RAM you need. Preallocate a numpy array to put the answer in. Then you need a new algorithm. Anything recursive or recursive like (ie a loop splitting the input,) will require tracking a lot of state, your nodes list is going to be extraordinarily gigantic and again use a lot of RAM. You could use len(a) to indicate values that are removed from your list and scan through the entire array each time to figure out what to do next but that will save RAM in favour of a tremendous amount of searching a gigantic array. I feel like there is an algorithm to cut numbers from each end and place them in the output and just track the beginning and end but I haven't figured it out at least not yet.
I also think there is a simpler algorithm where you just track the number of splits you've done instead of making a giant list of slices and keeping it all in memory. Take the middle of the left half and then the middle of the right then count up one and when you take the middle of the left half's left half you know you have to jump to the right half then the count is one so you jump over to the original right half's left half and on and on... Based on the depth into the halves and the length of the input you should be able to jump around without scanning or tracking all of those slices though I haven't been able to dedicate much time to thinking this through in my head.
With a problem of this nature if you really need to push the limits you should consider using C/C++ so you can be as efficient as possible with RAM usage and because you're doing an insane number of tiny things which doesn't map well to python performance.

Improving preformance by using multiple threads/cores

I have a game (using pygame) that I want to improve the performance of. I noticed that when I have low fps the game was only using 20% of the CPU at most, is there a way I can use threads to utilize more of the CPU?
I have tried to implement threads already, but seem to have no good luck, some help would be appreciated.
This function is what is causing the lag:
First Version
def SearchFood(self):
if not self.moving:
tempArr = np.array([])
for e in entityArr:
if type(e) == Food:
if e.rect != None and self.viewingRect != None:
if self.viewingRect.colliderect(e.rect):
tempArr = np.append(tempArr, e)
if tempArr.size > 0:
self.nearestFood = sorted(tempArr, key=lambda e: Mag((self.x - e.x, self.y - e.y)))[0]
Second Version (Slower)
def SearchFood(self):
if not self.moving:
s_arr = sorted(entityArr, key=lambda e: math.hypot(self.x - e.x, self.y - e.y))
for e, i in enumerate(s_arr):
if type(e) != Food:
self.nearestFood = None
self.nearestFood = s_arr[i]
I look through the entire list of entities and sort it after if the entity is food and the distance to the thing that wants to eat said food. Problem is that the entity array is 500 elements (and more) long and thus takes a really long time to iterate through and sort. Then to remedy that I want to make use of more of the CPU with the use of threading.
Here's the full script if that helps:
In Python, threading does not increase the number of used core. You must use multiprocessing instead.
The doc :
Multithreading in Python is nearly useless (for CPU-intensive tasks like this), and multiprocessing, while viable, requires expensive marshaling of data between processes or careful design. I don't believe either one is applicable to your case.
However, unless you have a huge amount of objects in your game, you shouldn't need to use multiple cores for your scenario. The issue seems more one of algorithmic complexity.
You can improve the performance of your code in several ways:
Keep an index of entities by type (e.g. a dict from entity-type to set of entities, which you update as entities are created/removed), which would allow you to easily find all the "food" entities without scanning through all entities in the game.
Find the nearest food entity using a simple "min" operation (which is O(n)) instead of sorting all the foods by distance (which is O(n*logn)).
If this is still slow you can apply a culling technique, where you first filter foods to those within an easily-computed range (e.g. a rectangle around the player), then find the nearest one by applying the more expensive distance computation only to those.
Make loops tighter by avoiding checking unnecessary conditions inside them, and whenever possible using builtin selection/creation constructs rather than iterating through large lists of objects.
e.g. you can end up with something like:
def find_nearest_food(self):
food_entities = self._entities_by_type[Food]
nearest_food = min(food_entities, key=lambda entity: distance_sq(self, entity))
return nearest_food
def distance_sq(ent1, ent2):
# we don't need an expensive square root operation if we're just comparing distances
dx, dy = (ent1.x - ent2.x), (ent1.y - ent2.y)
return dx * dx + dy * dy
You can optimize further by keeping entity positions as NumPy vectors instead of separate x and y properties, which would allow you to use NumPy operations to calculate distance, e.g. distance_sq = (ent1.pos - ent2.pos)**2 or just np.linalg.norm for regular distance computation. This might also be useful for other vector arithmetic operations.

Parallelizing array row similarity calculations in python

I have a large-ish array artist_topic_probs (112,312 item rows by ~100 feature columns), and I want to calculate the pairwise cosine similarity between a (large sample) of random pairs of rows from this array. Here's the relevant bits of my current code
# the number of random pairs to check (10 million here)
# I want to make sure they're unique, and that I'm never comparing a row to itself
# so I generate my set of comparisons like so:
comps = set()
while len(comps)<random_sample_size:
a = np.random.randint(0,112312)
b= np.random.randint(0,112312)
if a!=b:
comp = tuple(sorted([a,b]))
# convert to list at the end to ensure sort order
# not positive if this is needed...I've seen conflicting opinions
comps = list(sorted(comps))
This generates a list of tuples, where each are the two rows between which I'll calculate similarity. Then I just use a simple loop to calculate all the similarities:
c_dists = []
from scipy.spatial.distance import cosine
for a,b in comps:
(of course, cosine here gives distance, not a similarity, but we can easily get that with sim = 1.0 - dist. I used similarity in the title because it's the more common term)
This works fine, but isn't too fast, and I need to repeat the procedure many times. I have 32 cores to work with, so parallelization seems like a good bet, but I'm not sure the best way to go about it. My idea was something like:
pool = mp.Pool(processes=32)
c_dists = [pool.apply(cosine, args=(artist_topic_probs[a],artist_topic_probs[b]))
for a,b in comps]
But testing this approach out on my laptop with some test data hasn't been working (it just hangs, or at least is taking so much longer than the simple loop that I got sick of waiting and killed it). My concern is the indexing of the matrix being some sort of bottleneck, but I'm not sure. Any ideas on how to effectively parallelize this (or otherwise speed up the process)?
First of all, you might want to use itertools.combinations and random.sample to get unique pairs in the future, but it won't work in this case due to memory issues. Then, multiprocessing is not multithreading, i.e. spawning a new process involves huge system overhead. There is little sense in spawning a process for each individual task. A task must be well worth the overhead to rationalise starting a new process, hence you'd better split all work into separate jobs (into as many pieces as the number of cores you want to use). Then, don't forget that multiprocessing implementation serialises the entire namespace and loads it into memory N times, where N is the number of processes. This can lead to intensive swapping if you don't have enough RAM to store N copies of your huge array. So you might want to reduce the number of cores.
Updated to restore initial order as you requested.
I made a test data-set of identical vectors, hence cosine must return a vector of zeros.
from __future__ import division, print_function
import math
import multiprocessing as mp
from scipy.spatial.distance import cosine
from operator import itemgetter
import itertools
def worker(enumerated_comps):
return [(ind, cosine(artist_topic_probs[a], artist_topic_probs[b])) for ind, (a, b) in enumerated_comps]
def slice_iterable(iterable, chunk):
Slices an iterable into chunks of size n
:param chunk: the number of items per slice
:type chunk: int
:type iterable: collections.Iterable
:rtype: collections.Generator
_it = iter(iterable)
return itertools.takewhile(
bool, (tuple(itertools.islice(_it, chunk)) for _ in itertools.count(0))
# Test data
artist_topic_probs = [range(10) for _ in xrange(10)]
comps = tuple(enumerate([(1, 2), (1, 3), (1, 4), (1, 5)]))
n_cores = 2
chunksize = int(math.ceil(len(comps)/n_cores))
jobs = tuple(slice_iterable(comps, chunksize))
pool = mp.Pool(processes=n_cores)
work_res = pool.map_async(worker, jobs)
c_dists = map(itemgetter(1), sorted(itertools.chain(*work_res.get())))
[2.2204460492503131e-16, 2.2204460492503131e-16, 2.2204460492503131e-16, 2.2204460492503131e-16]
These values are fairly close to zero.
From the multiprocessing.Pool.apply docs
Equivalent of the apply() built-in function. It blocks until the
result is ready, so apply_async() is better suited for performing
work in parallel. Additionally, func is only executed in one of the
workers of the pool.
scipy.spatial.distance.cosine, as you can see following the link, introduces a significant overhead in your computations because for each invocation it computes the norm of the two vectors that you're analyzing at each invocation, for the size of your sample
this amounts to 20 millions norms computed, if you memorize the norms of your ~100 thousand vectors in advance you can save approximately 60% of your computation time because you have a dot product, u*v, and two norm calculations, and each of these three operations is roughly equivalent in terms of operations count.
Further, you're using explicit loops, if you could put your logic inside a vectorized numpy operator you could trim another large slice of your computational time.
Eventually, you talk about cosine similarity... consider that scipy.spatial.distance.cosine computes the cosine distance instead, the relationship is easy, cs = cd - 1 but I haven't seen this in your posted code.

Python - Multi processing to mount an array

I m using griddata to "mount" array with a great number of shapes and
i would like to know if i can calculate functions (on each slice) on each my 4 cores in order to accelerate the process?
import numpy
size = 8.
array[:,:,0] = 0
array[:,:,1] = X+Y
array[:,:,2] = X**2+Y**2+X+Y
array[:,:,3] = X**3+Y**3+X**2+Y**2+X+Y
array[:,:,4] = X**4+Y**4+X**3+Y**3+X**2+Y**2+X+Y
array[:,:,5] = X**5+Y**5+X**4+Y**4+X**3+Y**3+X**2+Y**2+X+Y
array[:,:,6] = X**6+Y**6+X**5+Y**5+X**4+Y**4+X**3+Y**3+X**2+Y**2+X+Y
array[:,:,6] = X**7+Y**7+X**6+Y**6+X**5+Y**5+X**4+Y**4+X**3+Y**3+X**2+Y**2+X+Y
So here i would like to calculate array[:,:,0] & array[:,:,1] with the first core, then array[:,:,2] & array[:,:,3] with the second core...?
There is no link between different "slices"...My different functions are independent
array[:,:,0] = 0
array[:,:,1] = X+Y
array[:,:,2] = X*np.cos(X)+Y*np.sin(Y)
array[:,:,3] = X**3+np.sin(X)+X**2+Y**2+np.sin(Y)
You can try with multiprocessing.Pool :
from multiprocessing import Pool
import numpy as np
size = 8.
def func(i): # you need to call a function with Pool
for j in range(1,i):
return array_
if __name__ == '__main__':
p = Pool(4) # if you have 4 cores in your processor, range(1,8))
for i in range(1,8):
Keep in mind that multiprocessing in python does not share memory, that's why you have to create the array_ and add the for-loop at the end of the code.
As your application (with these dimensions) doesn't need a lot of computing time, it is possible that you will be slower with this method. Also you will create multiple copies of all your variables, wich may cause a memory overflow.
You should also double-check the func I wrote, as I didn't completely verify that it does what it is supposed to do :)
If you want to apply a single function over an array of data, then using e.g. a multiprocessing.Pool is a good solution, provided that both the input and output of the calculation are relatively small.
You want to do many different calculations to two input arrays, which results in an array being returned for every one of those calculations.
Since separate processes do not share memory, the X and Y arrays have to be transported to each worker process when it is are started. And the result of each calculation (which is also a numpy array the same size as X and Y) has to be returned to the parent process.
Depending on e.g. the size of the arrays and the amount of cores, the overhead from the transfer of all those array between worker processes and the parent process via interprocess communication ("IPC") will cost time, reducing the advantages of using multiple cores.
Keep in mind that the parent process has to listen for and handle IPC requests from all the worker processes. So you've shifted the bottleneck from calculation to communication.
So it is not a given that multiprocessing will actually improve performance in this case. It depends on the details of the actual problem (number of cores, array size, amount of physical memory et cetera).
You will have to do some careful performance measurements using e.g. Pool or Process with realistic array sizes.
Three things:
The most important question is why are you doing this?.
Your NumPy build may already be making use of multiple cores. I am not sure off the top of my head how to check, see questions like this or if absolutely necessary take a look at the Numexpr library
About the "Y" in your likely XY problem - you are re-calculating data that you can instead re-use:
import numpy
size = 8
array = zeros((Y.shape[0], X.shape[0], size))
array[..., 0] = 0
for i in range(1, size):
array[..., 1] = X ** i + Y ** i + array[..., i - 1]

Memory utilization in recursive vs an iterative graph traversal

I have looked at some common tools like Heapy to measure how much memory is being utilized by each traversal technique but I don't know if they are giving me the right results. Here is some code to give the context.
The code simply measures the number of unique nodes in a graph. Two traversal techniques provided viz. count_bfs and count_dfs
import sys
from guppy import hpy
class Graph:
def __init__(self, key):
self.key = key #unique id for a vertex
self.connections = []
self.visited = False
def count_bfs(start):
parents = [start]
children = []
count = 0
while parents:
for ind in parents:
if not ind.visited:
count += 1
ind.visited = True
for child in ind.connections:
parents = children
children = []
return count
def count_dfs(start):
if not start.visited:
start.visited = True
return 0
n = 1
for connection in start.connections:
n += count_dfs(connection)
return n
def construct(file, s=1):
"""Constructs a Graph using the adjacency matrix given in the file
:param file: path to the file with the matrix
:param s: starting node key. Defaults to 1
:return start vertex of the graph
d = {}
f = open(file,'rU')
size = int(f.readline())
for x in xrange(1,size+1):
d[x] = Graph(x)
start = d[s]
for i in xrange(0,size):
l = map(lambda x: int(x), f.readline().split())
node = l[0]
for child in l[1:]:
return start
if __name__ == "__main__":
s = construct(sys.argv[1])
#h = hpy()
#print h.heap()
s = construct(sys.argv[1])
#h = hpy()
#print h.heap()
I want to know by what factor is the total memory utilization different in the two traversal techniques viz. count_dfs and count_bfs? One might have the intuition that dfs may be expensive as a new stack is created for every function call. How can the total memory allocations in each traversal technique be measured?
Do the (commented) hpy statements give the desired measure?
Sample file with connections:
1 2 3
2 1 3
3 4
This being a python question, it may be more important how much stack space is used than how much total memory. Cpython has a low limit of 1000 frames because it shares its call stack with the c call stack, which in turn is limited to the order of one megabyte in most places. For this reason you should almost* always prefer iterative solutions to recursive ones when the recursion depth is unbounded.
* other implementations of python may not have this restriction. The stackless variants of cpython and pypy have this exact property
For your specific problem, I don't know if there's going to be an easy solution. That's because, the peak memory usage of a graph traversal depends on the details of the graph itself.
For a depth-first traversal, the greatest usage will come when the algorithm has gone to the deepest depth. In your example graph, it will traverse 1->2->3->4, and create a stack frame for each level. So while it is at 4 it has allocated the most memory.
For the breadth-first traversal, the memory used will be proportional to the number of nodes at each depth plus the number of child nodes at the next depth. Those values are stored in lists, which are probably more efficient than stack frames. In the example, since the first node is connected to all the others, it happens immediately during the first step [1]->[2,3,4].
I'm sure there are some graphs that will do much better with one search or the other.
For example, imagine a graph that looked like a linked list, with all the vertices in a single long chain. The depth-first traversal will have a very-high peak memory useage, since it will recurse all the way down the chain, allocating a stack frame for each level. The breadth-first traversal will use much less memory, since it will only have a single vertex with a single child to keep track of on each step.
Now, contrast that with a graph that is a depth-2 tree. That is, there's a single root element that is connected to a great many children, none of which are connected to each other. The depth first traversal will not use much memory at any given time, as it will only need to traverse two nodes before it has to back up and try another branch. The depth-first traversal on the other hand will be putting all of the child nodes in memory at once, which for a big tree could be problematic.
Your current profiling code won't find the peak memory usage you want, because it only finds the memory used by objects on the heap at the time you call heap. That's likely to be the the same before and after your traversals. Instead, you'll need to insert profiling code into the traversal functions themselves. I can't find a pre-built package of guppy to try it myself, but I think this untested code will work:
from guppy import hpy
def count_bfs(start):
hp = hpy()
base_mem = hpy.heap().size
max_mem = 0
parents = [start]
children = []
count = 0
while parents:
for ind in parents:
if not ind.visited:
count += 1
ind.visited = True
for child in ind.connections:
mem = hpy.heap().size - base_mem
if mem > max_mem:
max_mem = mem
parents = children
children = []
return count, max_mem
def count_dfs(start, hp=hpy(), base_mem=None):
if base_mem is None:
base_mem = hp.heap().size
if not start.visited:
start.visited = True
return 0, hp.heap().size - base_mem
n = 1
max_mem = 0
for connection in start.connections:
c, mem = count_dfs(connection, base_mem)
if mem > max_mem:
max_mem = mem
n += c
return n, max_mem
Both traversal functions now return a (count, max-memory-used) tuple. You can try them out on a variety of graphs to see what the differences are.
It's tough to measure exactly how much memory is being used because systems vary in how they implement stack frames. Generally speaking, recursive algorithms use far more memory than iterative algorithms because each stack frame must store the state of its variables whenever a new function call occurs. Consider the difference between dynamic programming solutions and recursive solutions. Runtime is far faster on an iterative implementation of an algorithm than a recursive one.
If you really must know how much memory your code uses, load your software in a debugger such as OllyDbg ( and count the bytes. Happy coding!
Of the two, depth-first uses less memory if most traversals end up hitting most of the graph.
Breadth-first can be better when the target is near the starting node, or when the number of nodes doesn't go up very quickly so the parents/children arrays in your code stay small (e.g. another answer mentioned linked list as worst-case for DFS).
If the graph you're searching is spatial data, or has what's known as an "admissible heuristic," A* is another algorithm that's pretty good:
However, premature optimization is the root of all evil. Look at the actual data you want to use; if it fits in a reasonable amount of memory, and the search runs in a reasonable time, it doesn't matter which algorithm you use. NB, what's "reasonable" depends on the application you're using it for and the amount of resources on the hardware that will be running it.
For either search order implemented iteratively with the standard data structure describing it (queue for BFS, stack for DFS), I can construct a graph that uses O(n) memory trivially. For BFS, it's an n-star, and for DFS it's an n-chain. I don't believe either of them can be implemented for the general case to do better than that, so that also gives an Omega(n) lower bound on maximum memory usage. So, with efficient implementations of each, it should generally be a wash.
Now, if your input graphs have some characteristics that bias them more toward one of those extremes or the other, that might inform your decision on which to use in practice.
