Too slow queries in interval tree

Too slow queries in interval tree - python

I have a list of intervals and I need to return the ones that overlap with an interval passed in a query. What is special is that in a typical query around a third or even half of the intervals will overlap with the one given in the query. Also, the ratio of the shortest interval to the longest is not more than 1:5. I implemented my own interval tree (augmented red-black tree) - I did not want to use existing implementations because I needed support for closed intervals and some special features. I tested the query speed with 6000 queries in a tree with 6000 intervals (so n=6000 and m=3000 (app.)). It turned out that brute force is just as good as using the tree:
Computation time - loop: 125.220461 s
Tree setup: 0.05064 s
Tree Queries: 123.167337 s
Let me use asymptotic analysis. n: number of queries; n: number of intervals; app. n/2: number of intervals returned in a query:
time complexity brute force: n*n
time complexity tree: n*(log(n)+n/2) --> 1/2 nn + nlog(n) --> n*n
So the result is saying that the two should be roughly the same for a large n. Still one would somehow expect the tree to be noticeably faster given the constant 1/2 in front of n*n. So there are three possible reasons I can imagine for the results I got:
a) My implementation is wrong. (Should I be using BFS like below?)
b) My implementation is right, but I made things cumbersome for Python so it needs more time to deal with the tree than to deal with brute force.
c) everything is OK - it is just how things should behave for a large n
My query function looks like this:
from collections import deque
def query(self,low,high):
result = []
q = deque([self.root]) # this is the root node in the tree
append_result = result.append
append_q = q.append
pop_left = q.popleft
while q:
node = pop_left() # look at the next node
if node.overlap(low,high): # some overlap?
append_result(node.interval)
if node.low != None and low <= node.get_low_max(): # en-q left node
append_q(node.low)
if node.high != None and node.get_high_min() <= high: # en-q right node
append_q(node.high)
I build the tree like this:
def build(self, intervals):
"""
Function which is recursively called to build the tree.
"""
if intervals is None:
return None
if len(intervals) > 2: # intervals is always sorted in increasing order
mid = len(intervals)//2
# split intervals into three parts:
# central element (median)
center = intervals[mid]
# left half (<= median)
new_low = intervals[:mid]
#right half (>= median)
new_high = intervals[mid+1:]
#compute max on the lower side (left):
max_low = max([n.get_high() for n in new_low])
#store min on the higher side (right):
min_high = new_high[0].get_low()
elif len(intervals) == 2:
center = intervals[1]
new_low = [intervals[0]]
new_high = None
max_low = intervals[0].get_high()
min_high = None
elif len(intervals) == 1:
center = intervals[0]
new_low = None
new_high = None
max_low = None
min_high = None
else:
raise Exception('The tree is not behaving as it should...')
return(Node(center, self.build(new_low),self.build(new_high),
max_low, min_high))
EDIT:
A node is represented like this:
class Node:
def __init__(self, interval, low, high, max_low, min_high):
self.interval = interval # pointer to corresponding interval object
self.low = low # pointer to node containing intervals to the left
self.high = high # pointer to node containing intervals to the right
self.max_low = max_low # maxiumum value on the left side
self.min_high = min_high # minimum value on the right side
All the nodes in a subtree can be obtained like this:
def subtree(current):
node_list = []
if current.low != None:
node_list += subtree(current.low)
node_list += [current]
if current.high != None:
node_list += subtree(current.high)
return node_list
p.s. note that by exploiting that there is so much overlap and that all intervals have comparable lenghts, I managed to implement a simple method based on sorting and bisection that completed in 80 s, but I would say this is over-fitting... Amusingly, by using asymptotic analysis, I found it should have app. the same runtime as using the tree...

If I correctly understand your problem, you are trying to speed up your process.
If it is that, try to create a real tree instead of manipulating lists.
Something that looks like :
class IntervalTreeNode():
def __init__(self, parent, min, max):
self.value = (min,max)
self.parent = parent
self.leftBranch = None
self.rightBranch= None
def insert(self, interval):
...
def asList(self):
""" return the list that is this node and all the subtree nodes """
left=[]
if (self.leftBranch != None):
left = self.leftBranch.asList()
right=[]
if (self.rightBranch != None):
left = self.rightBranch.asList()
return [self.value] + left + right
And then at start create an internalTreeNode and insert all yours intervals in.
This way, if you really need a list you can build a list each time you need a result and not each time you make a step in your recursive iteration using [x:] or [:x] as list manipulation is a costly operation in python. It is possible to work also using directly the nodes instead of a list that will greatly speed up the process as you just have to return a reference to the node instead of doing some list addition.

Related

Recurrence relation and time complexity of finding next larger in Generic tree

Question: Given a generic tree and an integer n. Find and return the node with next larger element in the tree i.e. find a node with value just greater than n.
Although i was able to solve it is O(n) by removing the later for loop and doing comparisons while calling recursion. I am bit curious about time complexity of following version of code.
I came up with recurrence relation as T(n) = T(n-1) + (n-1) = O(n^2). Where T(n-1) is for time taken by children and + (n-1) for finding the next larger (second for loop). Have i done it right? or am i missing something?
def nextLargestHelper(root, n):
"""
root => reference to root node
n => integer
Returns node and value of node which is just larger not first larger than n.
"""
# Special case
if root is None:
return None, None
# list to store all values > n
largers = list()
# Induction step
if root.data > n:
largers.append([root, root.data])
# Induction step and Base case; if no children do not call recursion
for child in root.children:
# Induction hypothesis; my function returns me node and value just larger than 'n'
node, value = nextLargestHelper(child, n)
# If larger found in subtree
if node:
largers.append([node, value])
# Initialize node to none, and value as +infinity
node = None
value = sys.maxsize
# travers through all larger values and return the smallest value greater than n
for item in largers: # structure if item is [Node, value]
# this is why value is initialized to +infinity; so as it is true for first time
if item[1] < value:
node = item[0]
value = item[1]
return node, value

At first: please use different chacters for O-Notation and inputvalues.
You "touch" every node exactly once, so the result should be O(n). A bit special is your algorithm finding the minimum afterwards. You could include this in your go-though-all-children loop for an easier recurrence estimation. As it is, you have do a recurrence estimation for the minimum of the list as well.
Your recurrence equation should look more like T(n) = a*T(n/a) + c = O(n) since in each step you have a children forming a subtrees with size (n-1)/a. In each step you have next to some constant factors also the computation of the minimum of a list with at most a elements. You could write it as a*T(n/a) + a*c1 +c2 which is the same as a*T(n/a) + c. The actual formula would look more like this: T(n) = a*T((n-1)/a) + c but the n-1 makes it harder to apply the master theorem.

Raising performance of BFS in Python

How can I increase speed performance of below Python code?
My code works okay which means no errors but the performance of this code is very slow.
The input data is Facebook Large Page-Page Network dataset, you can access here the dataset: (http://snap.stanford.edu/data/facebook-large-page-page-network.html)
Problem definition:
Check if the distance between two nodes are less than max_distance
My constraints:
I have to import a .txt file of which format is like sample_input
Expected ouput is like
sample_output
Totall code runtime should be less than 5 secs.
Can anyone give me an advice to improve my code much better? Follow my code:
from collections import deque
class Graph:
def __init__(self, filename):
self.filename = filename
self.graph = {}
with open(self.filename) as input_data:
for line in input_data:
key, val = line.strip().split(',')
self.graph[key] = self.graph.get(key, []) + [val]
def check_distance(self, x, y, max_distance):
dist = self.path(x, y, max_distance)
if dist:
return dist - 1 <= max_distance
else:
return False
def path(self, x, y, max_distance):
start, end = str(x), str(y)
queue = deque([start])
while queue:
path = queue.popleft()
node = path[-1]
if node == end:
return len(path)
elif len(path) > max_distance:
return False
else:
for adjacent in self.graph.get(node, []):
queue.append(list(path) + [adjacent])
Thank you for your help in advance.

Several pointers:
if you call check distance more than once you have to recreate the graph
calling queue.pop(0) is inefficient on a standard list in python, use something like a deque from the collections module. see here
as DarrylG points out you can exit from the BFS early once a path has exceed the max distance
you could try
from collections import deque
class Graph:
def __init__(self, filename):
self.filename = filename
self.graph = self.file_to_graph()
def file_to_graph(self):
graph = {}
with open(self.filename) as input_data:
for line in input_data:
key, val = line.strip().split(',')
graph[key] = graph.get(key, []) + [val]
return graph
def check_distance(self, x, y, max_distance):
path_length = self.path(x, y, max_distance)
if path_length:
return len(path) - 1 <= max_distance
else:
return False
def path(self, x, y, max_distance):
start, end = str(x), str(y)
queue = deque([start])
while queue:
path = queue.popleft()
node = path[-1]
if node == end:
return len(path)
elif len(path) > max_distance:
# we have explored all paths shorter than the max distance
return False
else:
for adjacent in self.graph.get(node, []):
queue.append(list(path) + [adjacent])
As to why pop(0) is inefficient - from the docs:
Though list objects support similar operations, they are optimized for fast fixed-length operations and incur O(n) memory movement costs for pop(0) and insert(0, v) operations which change both the size and position of the underlying data representation.

About the approach:
You create a graph and execute several times comparison from one to another element in your graph. Every time you run your BFS algorithm. This will create a cost of O|E+V| at every time or, you need to calculate every time the distances again and again. This is not a good approach.
What I recommend. Run a Dijkstra Algorithm (that get the minimum distance between 2 nodes and store the info on an adjacency matrix. What you will need to do is only get the calculated info inside this adjacency matrix that will contains all minimal distances on your graph and what you will need is consume the distances calculated on a previous step
About the algorithms
I recommend you to look for different approaches of DFS/BFS.
If you're looking to compare all nodes I believe that Dijkstra's Algorithm will be more efficient in your case because they mark visited paths.(https://en.wikipedia.org/wiki/Dijkstra%27s_algorithm). You can modify and call the algo only once.
Other thing that you need to check is. Your graph contains cycles? If yes, you need to apply some control on cycles, you'll need to check about Ford Fulkerson Algorithm (https://en.wikipedia.org/wiki/Ford%E2%80%93Fulkerson_algorithm)
As I understood. Every time that you want to compare a node to another, you run again your algorithm. If you have 1000 elements on your graph, your comparison, at every time will visit 999 nodes to check this.
If you implement a Dijkstra and store the distances, you run only once for your entire network and save the distances in memory.
The next step is collect from memory the distances that you can put in an array.
You can save all distances on an adjacency matrix (http://opendatastructures.org/versions/edition-0.1e/ods-java/12_1_AdjacencyMatrix_Repres.html) and only consume the information several times without the calculation debt at every time.

Calculate height of an arbitrary (non-binary) tree

I'm currently taking on online data structures course and this is one of the homework assignments; please guide me towards the answer rather than giving me the answer.
The prompt is as follows:
Task. You are given a description of a rooted tree. Your task is to compute and output its height. Recall that the height of a (rooted) tree is the maximum depth of a node, or the maximum distance from a leaf to the root. You are given an arbitrary tree, not necessarily a binary tree.
Input Format. The first line contains the number of nodes n. The second line contains integer numbers from −1 to n−1 parents of nodes. If the i-th one of them (0 ≤ i ≤ n−1) is −1, node i is the root, otherwise it’s 0-based index of the parent of i-th node. It is guaranteed that there is exactly one root. It is guaranteed that the input represents a tree.
Constraints. 1 ≤ n ≤ 105.
My current solution works, but is very slow when n > 102. Here is my code:
# python3
import sys
import threading
# In Python, the default limit on recursion depth is rather low,
# so raise it here for this problem. Note that to take advantage
# of bigger stack, we have to launch the computation in a new thread.
sys.setrecursionlimit(10**7) # max depth of recursion
threading.stack_size(2**27) # new thread will get stack of such size
threading.Thread(target=main).start()
# returns all indices of item in seq
def listOfDupes(seq, item):
start = -1
locs = []
while True:
try:
loc = seq.index(item, start+1)
except:
break
else:
locs.append(loc)
start = loc
return locs
def compute_height(node, parents):
if node not in parents:
return 1
else:
return 1 + max(compute_height(i, parents) for i in listOfDupes(parents, node))
def main():
n = int(input())
parents = list(map(int, input().split()))
print(compute_height(parents.index(-1), parents))
Example input:
>>> 5
>>> 4 -1 4 1 1
This will yield a solution of 3, because the root is 1, 3 and 4 branch off of 1, then 0 and 2 branch off of 4 which gives this tree a height of 3.
How can I improve this code to get it under the time benchmark of 3 seconds? Also, would this have been easier in another language?

Python will be fine as long as you get the algorithm right. Since you're only looking for guidance, consider:
1) We know the depth of a node iif the depth of its parent is known; and
2) We're not interested in the tree's structure, so we can throw irrelevant information away.
The root node pointer has the value -1. Suppose that we replaced its children's pointers to the root node with the value -2, their children's pointers with -3, and so forth. The greatest absolute value of these is the height of the tree.
If we traverse the tree from an arbitrary node N(0) we can stop as soon as we encounter a negative value at node N(k), at which point we can replace each node with the value of its parent, less one. I.e, N(k-1) = N(k) -1, N(k-2)=N(k-1) - 1... N(0) = N(1) -1. As more and more pointers are replaced by their depth, each traversal is more likely to terminate by encountering a node whose depth is already known. In fact, this algorithm takes basically linear time.
So: load your data into an array, start with the first element and traverse the pointers until you encounter a negative value. Build another array of the nodes traversed as you go. When you encounter a negative value, use the second array to replace the original values in the first array with their depth. Do the same with the second element and so forth. Keep track of the greatest depth you've encountered: that's your answer.

The structure of this question looks like it would be better solved bottom up rather than top down. Your top-down approach spends time seeking, which is unnecessary, e.g.:
def height(tree):
for n in tree:
l = 1
while n != -1:
l += 1
n = tree[n]
yield l
In []:
tree = '4 -1 4 1 1'
max(height(list(map(int, tree.split()))))
Out[]:
3
Or if you don't like a generator:
def height(tree):
d = [1]*len(tree)
for i, n in enumerate(tree):
while n != -1:
d[i] += 1
n = tree[n]
return max(d)
In []:
tree = '4 -1 4 1 1'
height(list(map(int, tree.split())))
Out[]:
3
The above is brute force as it doesn't take advantage of reusing parts of the tree you've already visited, it shouldn't be too hard to add that.

Your algorithm spends a lot of time searching the input for the locations of numbers. If you just iterate over the input once, you can record the locations of each number as you come across them, so you don't have to keep searching over and over later. Consider what data structure would be effective for recording this information.

Simple Genetic Algorithm meeting local optimum for "Hello World"

My target was simple, using genetic algorithm to reproduce the classical "Hello, World" string.
My code was based on this post. The code mainly contain 4 parts:
Generate the population which has serval different individual
Define the fitness and grade function which evaluate the individual good or bad based on the comparing with target.
Filter the population and leave len(pop)*retain individuals
Add some other individuals and mutate randomly
The parents's DNA will pass over to its children to comprise the whole population.
I modified the code and shows like this:
import numpy as np
import string
from operator import add
from random import random, randint
def population(GENSIZE,target):
p = []
for i in range(0,GENSIZE):
individual = [np.random.choice(list(string.printable[:-5])) for j in range(0,len(target))]
p.append(individual)
return p
def fitness(source, target):
fitval = 0
for i in range(0,len(source)-1):
fitval += (ord(target[i]) - ord(source[i])) ** 2
return (fitval)
def grade(pop, target):
'Find average fitness for a population.'
summed = reduce(add, (fitness(x, target) for x in pop))
return summed / (len(pop) * 1.0)
def evolve(pop, target, retain=0.2, random_select=0.05, mutate=0.01):
graded = [ (fitness(x, target), x) for x in p]
graded = [ x[1] for x in sorted(graded)]
retain_length = int(len(graded)*retain)
parents = graded[:retain_length]
# randomly add other individuals to
# promote genetic diversity
for individual in graded[retain_length:]:
if random_select > random():
parents.append(individual)
# mutate some individuals
for individual in parents:
if mutate > random():
pos_to_mutate = randint(0, len(individual)-1)
individual[pos_to_mutate] = chr(ord(individual[pos_to_mutate]) + np.random.randint(-1,1))
#
parents_length = len(parents)
desired_length = len(pop) - parents_length
children = []
while len(children) < desired_length:
male = randint(0, parents_length-1)
female = randint(0, parents_length-1)
if male != female:
male = parents[male]
female = parents[female]
half = len(male) / 2
child = male[:half] + female[half:]
children.append(child)
parents.extend(children)
return parents
GENSIZE = 40
target = "Hello, World"
p = population(GENSIZE,target)
fitness_history = [grade(p, target),]
for i in xrange(20):
p = evolve(p, target)
fitness_history.append(grade(p, target))
# print p
for datum in fitness_history:
print datum
But it seems that the result can't fit targetwell.
I tried to change the GENESIZE and loop time(more generation).
But the result always get stuck. Sometimes, enhance the loop time can help to find a optimum solution. But when I change the loop time to an much larger number like for i in xrange(10000). The result shows the error like:
individual[pos_to_mutate] = chr(ord(individual[pos_to_mutate]) + np.random.randint(-1,1))
ValueError: chr() arg not in range(256)
Anyway, how to modify my code and get an good result.
Any advice would be appreciate.

The chr function in Python2 only accepts values in the range 0 <= i < 256.
You are passing:
ord(individual[pos_to_mutate]) + np.random.randint(-1,1)
So you need to check that the result of
ord(individual[pos_to_mutate]) + np.random.randint(-1,1)
is not going to be outside that range, and take corrective action before passing to chr if it is outside that range.
EDIT
A reasonable fix for the ValueError might be to take the amended value modulo 256 before passing to chr:
chr((ord(individual[pos_to_mutate]) + np.random.randint(-1, 1)) % 256)
There is another bug: the fitness calculation doesn't take the final element of the candidate list into account: it should be:
def fitness(source, target):
fitval = 0
for i in range(0,len(source)): # <- len(source), not len(source) -1
fitval += (ord(target[i]) - ord(source[i])) ** 2
return (fitval)
Given that source and target must be of equal length, the function can be written as:
def fitness(source, target):
return sum((ord(t) - ord(s)) ** 2 for (t, s) in zip(target, source))
The real question was, why doesn't the code provided evolve random strings until the target string is reached.
The answer, I believe, is it may, but will take a lot of iterations to do so.
Consider, in the blog post referenced in the question, each iteration generates a child which replaces the least fit member of the gene pool if the child is fitter. The selection of the child's parent is biased towards fitter parents, increasing the likelihood that the child will enter the gene pool and increase the overall "fitness" of the pool. Consequently the members of the gene pool converge on the desired result within a few thousand iterations.
In the code in the question, the probability of mutation is much lower, based on the initial conditions, that is the defaults for the evolve function.
Parents that are retained have only a 1% chance of mutating, and one third of the time the "mutation" will not result in a change (zero is a possible result of random.randint(-1, 1)).
Discard parents are replaced by individuals created by merging two retained individuals. Since only 20% of parents are retained, the population can converge on a local minimum where each new child is effectively a copy of an existing parent, and so no diversity is introduced.
So apart from fixing the two bugs, the way to converge more quickly on the target is to experiment with the initial conditions and to consider changing the code in the question to inject more diversity, for example by mutating children as in the original blog post, or by extending the range of possible mutations.

kd tree build sort optimize

The algorithm build of kd-tree implemented in the Python programming language is as follows (from http://en.wikipedia.org/wiki/K-d_tree):
class Node: pass
def kdtree(point_list, depth=0):
if not point_list:
return None
# Select axis based on depth so that axis cycles through all valid values
k = len(point_list[0]) # assumes all points have the same dimension
axis = depth % k
# Sort point list and choose median as pivot element
point_list.sort(key=lambda point: point[axis])
median = len(point_list) // 2 # choose median
# Create node and construct subtrees
node = Node()
node.location = point_list[median]
node.left_child = kdtree(point_list[:median], depth + 1)
node.right_child = kdtree(point_list[median + 1:], depth + 1)
return node
Sorting is performed on every step. How to reduce the amount of sorting?

It looks like you're only sorting to split around the median. Instead, you could implement a linear-time selection algorithm such as quickselect, then do a linear-time partition of point_list. Then, you don't need to sort at all anymore.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.