Are binary partition trees premade before points are added to nodes? - python

I'm trying to implement this algorithm in Python, but due to my lack of understanding tree structures I'm confused about creation process of the partition tree.
Brief Explanation:
Algorithm that was linked, is for partitioning a high-dimensional feature space into internal and leaf nodes so that query can be performed quickly.
It divides a large space using specific random test, hyperplane that splits one large cell into two.
This answer explains everything much more precisely.
(taken from the link above)
Code Fragments:
def random_test(self, main_point): # Main point is np.ndarray instance
dimension = main_point.ravel().size
random_coefficients = self.random_coefficients(dimension)
scale_values = np.array(sorted([np.inner(random_coefficients, point.ravel())
for point in self.points]))
percentile = random.choice([np.percentile(scale_values, 100 * self.ratio), # Just as described on Section 3.1
np.percentile(scale_values, 100 * (1 - self.ratio))])
main_term = np.inner(main_point.ravel(), random_coefficients)
if self.is_leaf():
return 0 # Next node is the center leaf child
else:
if (main_term - percentile) >= 0: # Hyper-plane equation defined in the document
return -1 # Next node is the left child
else:
return 1 # Next node is the right child
self.ratio as mentioned in the algorithm linked above, is determining how balanced and shallow the tree will be, at 1/2 it is supposed to generate the most balanced and shallow tree.
Then we move onto the iterative part, where the tree keeps dividing the space further and further until it reaches the leaf node (notice the keyword reaches), the problem is, it will never truly reaches the leaf node.
Since, the definition of leaf node in the document linked above is this:
def is_leaf(self):
return (self.capacity * self.ratio) <= self.cell_count() <= self.capacity
where self.cell_count() is number of points in the cell, self.capacity is the maximum amount of points that the cell can have and self.ratio is the split ratio.
My full code should basically divide the feature space by creating new nodes at initial iteration until the leaf node is created (but the leaf node is never created). See the fragment that contains the division process.
(taken from the document linked above)
tl;dr:
Are binary partition trees prepared (filled with empty nodes) before we add any points to them? If so, don't we require to define the level (depth) of the tree?
If not, are binary partition trees created while adding points to them? If so, then how is the first point (from the first iteration) added to the tree?

They are built as you go along. The first node is right or left of line 1. Then the next level asks right or left of line 2... your illustration from the provided paper shows this with the lines being numbered in association with the choice presented for finding the node.
Ofcourse right or left is not accurate. Some lines are cut horizontally. But the spaces created are binary.

I've been able to test the new method as mentioned in the comments, and it worked perfectly fine.
The algorithm that was linked above, implicitly states that the point shall be individually dropped down into the partition tree, passing all the random tests and creating new nodes as it is dropped down.
But there is a significant problem with this method, since in order to have a balanced efficient and shallow tree, left and right nodes must be distributed evenly.
Hence, in order to split the node, at every level of the tree, every point of the node must be passed to either left or right node (by a random test), until the tree reaches the depth where all nodes at that level are leaf.
In mathematical terms, root node contains a vector space which is divided into two left and right nodes containing convex polyhedrons bounded by supporting hyper-planes by the separating hyper-plane:
Negative term of the equation (I believe we can call it bias), is where the splitting ratio starts to play, it should be percentile of all node points between 100*r to 100*(1-r), so that tree is separated more evenly and it is more shallow. Basically it decides how even should hyper-plane separation be, that's why we require nodes that contain all the points.
I have been able to implement such system:
def index_space(self):
shuffled_space = self.shuffle_space()
current_tree = PartitionTree()
level = 0
root_node = RootNode(shuffled_space, self.capacity, self.split_ratio, self.indices)
current_tree.root_node = root_node
current_tree.node_array.append(root_node)
current_position = root_node.node_position
node_array = {0: [root_node]}
while True:
current_nodes = node_array[level]
if all([node.is_leaf() for node in current_nodes]):
break
else:
level += 1
node_array[level] = []
for current_node in current_nodes:
if not current_node.is_leaf():
left_child = InternalNode(self.capacity, self.split_ratio, self.indices,
self._append(current_position, [-1]), current_node)
right_child = InternalNode(self.capacity, self.split_ratio, self.indices,
self._append(current_position, [1]), current_node)
for point in current_node.list_points():
if current_node.random_test(point) == 1:
right_child.add_point(point)
else:
left_child.add_point(point)
node_array[level].extend([left_child, right_child])
where node_array contains all the nodes of the tree (root, internal and leaf).
Unfortunately, node.random_test(x) method:
def random_test(self, main_point):
random_coefficients = self.random_coefficients()
scale_values = [np.inner(self.random_coefficients(), point[:self.indices].ravel())
for point in self.points]
percentile = np.percentile(scale_values, self.ratio * 100)
main_term = np.inner(main_point[:self.indices].ravel(), random_coefficients)
if self.is_leaf():
return 0 # Next node is the center leaf child
else:
if (main_term - percentile) >= 0: # Hyper-plane equation defined in the document
return -1 # Next node is the left child
else:
return 1 # Next node is the right child
is inefficient, since calculating percentile takes too much time. Hence I have to find another way to calculate percentile (perhaps by performing short-circuited binary search to optimize percentile).
Conclusion:
This is just a large extension of Clinton Ray Mulligan's answer - which briefly explains the solution to create such trees and hence will remain as an accepted answer.
I have just added more details in case anyone is interested in implementing randomized binary partition trees.

Related

How to calculate the number of nodes in an unbalanced binary tree at a given depth

I am trying to calculate the number of nodes a tree will have at a given depth if the binary tree is not balanced.
I know that in the case of a perfectly balanced, you can use 2^d to calculate the number of nodes, where d is the depth of a tree.
Assume there is a binary tree. At the root level, it only has one node. Also, assume that the root node only has one child instead of 2. So at the next depth, there is only one node instead of 2. which means that at the next depth, there will be only two nodes instead of 4. in the next depth, there will be eight instead of 16.
So yeah is there any way I can foretell the number of nodes there will be at a given depth based on the number of nodes present or not present in the previous depth.
Any kind of answer would do if there is a mathematical formula that will help. If you know a way I could do it iteratively in breadth-first search order in any programming language that would help too.
If you know the number of nodes at depth 𝑑 is 𝑛, then the number of nodes at depth 𝑑 + 1 lies between 0 and 2𝑛. The minimum of 0 is reached when all those nodes at depth 𝑛 happen to be leaves, and the maximum of 2𝑛 is reached when all those nodes at depth 𝑛 happen to have two children each.

Shortest path between two nodes with fixed number of nodes in path

I have a weighted graph with around 800 nodes, each with a number of connections ranging from 1 to around 300. I need to find the shortest (lowest cost) path between two nodes with some extra criteria:
The path must contain exactly five nodes.
Each node has an attribute (called position in the example code) that takes one of five values; the five nodes in the path must all have unique values for this attribute.
The algorithm needs to allow for 1-2 required nodes to be specified that the path must contain at some point in any order.
The algorithm needs to take less than 10 seconds to run, preferably as little time as possible while losing as little accuracy as possible.
My current solution in Python is to run a Depth-Limited Depth-First Search which recursively searches every possible path. To make this algorithm run in reasonable time I have introduced a limit to the number of neighbour nodes that are searched at each recursion level. This number can be lowered to decrease the computation time but at the cost of accuracy. Currently this algorithm is far too slow, with my most recent test coming in at 75 seconds with a neighbour limit of 30. If I decrease this neighbour limit any more, my testing shows that the accuracy of the algorithm begins to suffer badly. I am out of ideas on how to solve this problem while satisfying all of the above criteria. My code is as follows:
# The path must go from start -> end, be of length 5 and contain all nodes in middle
# Each node is represented as a tuple: (value, position)
def dfs(start, end, middle=[], path=Path(), best=Path([], math.inf)):
# If this is the first level of recursion, initialise the path variable
if len(path) == 0:
path = Path([start])
# If the max depth has been exceeded, check if the current node is the goal node
if len(path) >= depth:
# If it is, save the path
# Check that all required nodes have been visited
if len(path) == depth and start == end and path.cost < best.cost and all(x in path.path for x in middle):
# Store the new best path
best.path = path.path
best.cost = path.cost
return
# Run DFS on all of the neighbors of the node that haven't been searched already
# Use the weights of the neighbors as a heuristic; sort by lowest weight first
neighbors = sorted([x for x in graph.get(*start).connected_nodes], key=lambda x: graph.weight(start, x))
# Make sure that all neighbors haven't been visited yet and that their positions aren't already accounted for
positions = [x[1] for x in path.path]
# Only visit neighbouring nodes with unique positions and ids
filtered = [x for x in neighbors if x not in path.path and x[1] not in positions]
for neighbor in filtered[:neighbor_limit]:
if neighbor not in path.path:
dfs(neighbor, end, middle, Path(path.path + [neighbor], path.cost + graph.weight(start, neighbor)), best)
return best
Path Class:
class Path:
def __init__(self, path=[], cost=0):
self.path = path
self.cost = cost
def __len__(self):
return len(self.path)
Any help in improving this algorithm or even suggestions on a better approach to the problem would be much appreciated, thanks in advance!
You should iterate over all possible orderings of the 'position' attribute, and for each one use Dijkstra's algorithm or BFS to find the shortest path that respects that ordering.
Since you know the position of the first and last nodes, there are only 3! = 6 different orderings for the intermediate nodes, so you only have to run Dijkstra's algorithm 6 times.
Even in python, this shouldn't take more than a couple hundred milliseconds to run, based on the input sizes you provided.

Analyzing BST is balanced Code

I was looking at this python code to check if a Binary Search Tree is balanced. However, I was wondering if someone could explain how
height = max(left_height, right_height) + 1 comes into play. How does this line of code work to calculate the height. And finally, how this
is_left_balanced, left_height = TreeNode.is_balanced(cur_node.left)
is_right_balanced, right_height = TreeNode.is_balanced(cur_node.right)
stack recursion is working because once is_balanced(cur_node.left) is called, its called again when the second TreeNode.is_balanced(cur_node.right) is called.
Im having trouble following the stack trace for this.
Here is the entire code:
def is_balanced(cur_node) :
if (not cur_node) :
height = 0
return True, height
is_left_balanced, left_height = TreeNode.is_balanced(cur_node.left)
is_right_balanced, right_height = TreeNode.is_balanced(cur_node.right)
#To get the height of the current node, we find the maximum of the
#left subtree height and the right subtree height and add 1 to it
height = max(left_height, right_height) + 1
if (not is_left_balanced or not is_right_balanced):
return False, height
#If the difference between height of left subtree and height of
#right subtree is more than 1, then the tree is unbalanced
if (abs(left_height - right_height) > 1):
return False, height
return True, height
When dealing with recursion problems, you should always pay attention to what your base cases are. In this program, there is only one base case: an empty tree, which has a trivially obtainable height of 1 and is, by definition, balanced.
Now assuming a tree is unbalanced if its tallest half has a height of more than 1 plus the height of its other half (that is, its tallest half is taller by at least two levels), or if one of its subtrees is unbalanced, then we get a recursive way of calculating this.
First, return 0 and True for the trivially balanced (empty) tree. We got our base case.
Second, if either of the halves in unbalanced then the tree is unbalanced. This is checked for each subtree, and thus recursion will keep going until finding an empty tree, where it will start returning (picture the case of a tree with only one level, one node. Imagine how the code will run and you'll probably be able to extrapolate from there).
At last, if each subtree is balanced (that's the third case, since we already got that far into the program) then the only way the tree is unbalanced is if one of its subtrees is taller than the other by more than one level. We just check that, and return the opposite of that value.
I hope this helped you understand, feel free to ask me any other questions otherwise!

Memory utilization in recursive vs an iterative graph traversal

I have looked at some common tools like Heapy to measure how much memory is being utilized by each traversal technique but I don't know if they are giving me the right results. Here is some code to give the context.
The code simply measures the number of unique nodes in a graph. Two traversal techniques provided viz. count_bfs and count_dfs
import sys
from guppy import hpy
class Graph:
def __init__(self, key):
self.key = key #unique id for a vertex
self.connections = []
self.visited = False
def count_bfs(start):
parents = [start]
children = []
count = 0
while parents:
for ind in parents:
if not ind.visited:
count += 1
ind.visited = True
for child in ind.connections:
children.append(child)
parents = children
children = []
return count
def count_dfs(start):
if not start.visited:
start.visited = True
else:
return 0
n = 1
for connection in start.connections:
n += count_dfs(connection)
return n
def construct(file, s=1):
"""Constructs a Graph using the adjacency matrix given in the file
:param file: path to the file with the matrix
:param s: starting node key. Defaults to 1
:return start vertex of the graph
"""
d = {}
f = open(file,'rU')
size = int(f.readline())
for x in xrange(1,size+1):
d[x] = Graph(x)
start = d[s]
for i in xrange(0,size):
l = map(lambda x: int(x), f.readline().split())
node = l[0]
for child in l[1:]:
d[node].connections.append(d[child])
return start
if __name__ == "__main__":
s = construct(sys.argv[1])
#h = hpy()
print(count_bfs(s))
#print h.heap()
s = construct(sys.argv[1])
#h = hpy()
print(count_dfs(s))
#print h.heap()
I want to know by what factor is the total memory utilization different in the two traversal techniques viz. count_dfs and count_bfs? One might have the intuition that dfs may be expensive as a new stack is created for every function call. How can the total memory allocations in each traversal technique be measured?
Do the (commented) hpy statements give the desired measure?
Sample file with connections:
4
1 2 3
2 1 3
3 4
4
This being a python question, it may be more important how much stack space is used than how much total memory. Cpython has a low limit of 1000 frames because it shares its call stack with the c call stack, which in turn is limited to the order of one megabyte in most places. For this reason you should almost* always prefer iterative solutions to recursive ones when the recursion depth is unbounded.
* other implementations of python may not have this restriction. The stackless variants of cpython and pypy have this exact property
For your specific problem, I don't know if there's going to be an easy solution. That's because, the peak memory usage of a graph traversal depends on the details of the graph itself.
For a depth-first traversal, the greatest usage will come when the algorithm has gone to the deepest depth. In your example graph, it will traverse 1->2->3->4, and create a stack frame for each level. So while it is at 4 it has allocated the most memory.
For the breadth-first traversal, the memory used will be proportional to the number of nodes at each depth plus the number of child nodes at the next depth. Those values are stored in lists, which are probably more efficient than stack frames. In the example, since the first node is connected to all the others, it happens immediately during the first step [1]->[2,3,4].
I'm sure there are some graphs that will do much better with one search or the other.
For example, imagine a graph that looked like a linked list, with all the vertices in a single long chain. The depth-first traversal will have a very-high peak memory useage, since it will recurse all the way down the chain, allocating a stack frame for each level. The breadth-first traversal will use much less memory, since it will only have a single vertex with a single child to keep track of on each step.
Now, contrast that with a graph that is a depth-2 tree. That is, there's a single root element that is connected to a great many children, none of which are connected to each other. The depth first traversal will not use much memory at any given time, as it will only need to traverse two nodes before it has to back up and try another branch. The depth-first traversal on the other hand will be putting all of the child nodes in memory at once, which for a big tree could be problematic.
Your current profiling code won't find the peak memory usage you want, because it only finds the memory used by objects on the heap at the time you call heap. That's likely to be the the same before and after your traversals. Instead, you'll need to insert profiling code into the traversal functions themselves. I can't find a pre-built package of guppy to try it myself, but I think this untested code will work:
from guppy import hpy
def count_bfs(start):
hp = hpy()
base_mem = hpy.heap().size
max_mem = 0
parents = [start]
children = []
count = 0
while parents:
for ind in parents:
if not ind.visited:
count += 1
ind.visited = True
for child in ind.connections:
children.append(child)
mem = hpy.heap().size - base_mem
if mem > max_mem:
max_mem = mem
parents = children
children = []
return count, max_mem
def count_dfs(start, hp=hpy(), base_mem=None):
if base_mem is None:
base_mem = hp.heap().size
if not start.visited:
start.visited = True
else:
return 0, hp.heap().size - base_mem
n = 1
max_mem = 0
for connection in start.connections:
c, mem = count_dfs(connection, base_mem)
if mem > max_mem:
max_mem = mem
n += c
return n, max_mem
Both traversal functions now return a (count, max-memory-used) tuple. You can try them out on a variety of graphs to see what the differences are.
It's tough to measure exactly how much memory is being used because systems vary in how they implement stack frames. Generally speaking, recursive algorithms use far more memory than iterative algorithms because each stack frame must store the state of its variables whenever a new function call occurs. Consider the difference between dynamic programming solutions and recursive solutions. Runtime is far faster on an iterative implementation of an algorithm than a recursive one.
If you really must know how much memory your code uses, load your software in a debugger such as OllyDbg (http://www.ollydbg.de/) and count the bytes. Happy coding!
Of the two, depth-first uses less memory if most traversals end up hitting most of the graph.
Breadth-first can be better when the target is near the starting node, or when the number of nodes doesn't go up very quickly so the parents/children arrays in your code stay small (e.g. another answer mentioned linked list as worst-case for DFS).
If the graph you're searching is spatial data, or has what's known as an "admissible heuristic," A* is another algorithm that's pretty good: http://en.wikipedia.org/wiki/A_star
However, premature optimization is the root of all evil. Look at the actual data you want to use; if it fits in a reasonable amount of memory, and the search runs in a reasonable time, it doesn't matter which algorithm you use. NB, what's "reasonable" depends on the application you're using it for and the amount of resources on the hardware that will be running it.
For either search order implemented iteratively with the standard data structure describing it (queue for BFS, stack for DFS), I can construct a graph that uses O(n) memory trivially. For BFS, it's an n-star, and for DFS it's an n-chain. I don't believe either of them can be implemented for the general case to do better than that, so that also gives an Omega(n) lower bound on maximum memory usage. So, with efficient implementations of each, it should generally be a wash.
Now, if your input graphs have some characteristics that bias them more toward one of those extremes or the other, that might inform your decision on which to use in practice.

a* search. adding movement cost from initial position

if f = g + h then where in the below code would I add g?
Also, besides adding the movement cost from my initial position, how else can I make this code more efficient?
def a_star(initial_node):
open_set, closed_set = dict(), list()
open_set[initial_node] = heuristic(initial_node)
while open_set:
onode = get_next_best_node(open_set)
if onode == GOAL_STATE:
return reconstruct_path(onode)
del open_set[onode]
closed_set.append(onode)
for snode in get_successor_nodes(onode):
if snode in closed_set:
continue
if snode not in open_set:
open_set[snode] = heuristic(snode)
self.node_rel[snode] = onode
return False
In the last if, if snode is not in open_set (no pun intended!) you shouldn't set just the heuristic, but the heuristic plus the cost of the current node. And if snode is in the open set, you should check the minimum between the present value and the current one (if there are two or more ways to reach the same node, the least costly one should be considered).
That means you need to store both the node's "actual" cost and the "estimated" cost. The actual cost of the initial node is zero. For every new node, it's the minimum for every incoming arc between the cost of the other vertex plus the cost of the arc (in other words, the cost of the last node plus the cost to move from that to the current one). The estimated cost would have to sum both values: the actual cost so far plus the heuristic function.
I don't know how the nodes are represented in your code, so I can't give advice more specific than that. If you still have doubt, please edit your question providing more details.

Categories