Python: Find running median with Max-Heap and Min-Heap

Python: Find running median with Max-Heap and Min-Heap - python

I'm trying to return the running median for a series of streaming numbers. To do that I use a max-heap (which stores the values on the lower half of the series) and a min-heap (which stores the values on the higher half of the series).
In particular I'm using the Python (2.0) built-in min-heap data structure from the heapq module (https://docs.python.org/2/library/heapq.html). To build the max-heap instead I simply use the negative of the numbers I need to push into my heap.
My Python code is the following:
import heapq
maxh = []
minh = []
vals=[1,2,3,4,5,6,7,8,9,10]
for val in vals:
# Initialize the data-structure and insert/push the 1st streaming value
if not maxh and not minh:
heapq.heappush(maxh,-val)
print float(val)
elif maxh:
# Insert/push the other streaming values
if val>-maxh[0]:
heapq.heappush(minh,val)
elif val<-maxh[0]:
heapq.heappush(maxh,-val)
# Calculate the median
if len(maxh)==len(minh):
print float(-maxh[0]+minh[0])/2
elif len(maxh)==len(minh)+1:
print float(-maxh[0])
elif len(minh)==len(maxh)+1:
print float(minh[0])
# If min-heap and max-heap grow unbalanced we rebalance them by
# removing/popping one element from a heap and inserting/pushing
# it into the other heap, then we calculate the median
elif len(minh)==len(maxh)+2:
heapq.heappush(maxh,-heapq.heappop(minh))
print float(-maxh[0]+minh[0])/2
elif len(maxh)==len(minh)+2:
heapq.heappush(minh,-heapq.heappop(maxh))
print float(-maxh[0]+minh[0])/2
Below is the full list of test cases I've built to check my code:
vals=[1,2,3,4,5,6,7,8,9,10] # positive numbers, increasing series
vals=[10,9,8,7,6,5,4,3,2,1] # positive numbers, decreasing series
vals=[10,9,11,8,12,7,13,6,14,5] # positive numbers, jumping series (keeping
# heaps balanced)
vals=[-10,-9,-8,-7,-6,-5,-4,-3,-2,-1] # negative numbers, increasing series
vals=[-1,-2,-3,-4,-5,-6,-7,-8,-9,-10] # negative numbers, decreasing series
vals=[-10,-9,-11,-8,-12,-7,-13,-6,-14,-5] # negative numbers
# jumping series (keeping heaps
# balanced)
vals=[-5,-4,-3,-2,-1,0,1,2,3,4,5] # mixed positive-negative numbers,
# increasing series
vals=[5,4,3,2,1,0,-1,-2,-3,-4,-5] # mixed positive-negative numbers,
# decreasing series
vals=[0,-1,1,-2,2,-3,3,-4,4,-5,5] # mixed positive-negative numbers,
# jumping series (keeping heaps balanced)
My code seems ok to me but I cannot pass 4 out of 10 test cases with an online judge (https://www.hackerrank.com/challenges/ctci-find-the-running-median/problem).
Do you have any hint?

The problem is here:
# Insert/push the other streaming values
if val>-maxh[0]:
heapq.heappush(minh,val)
elif val<-maxh[0]:
heapq.heappush(maxh,-val)
If val == maxh[0], then the item is never pushed onto either heap. You should be able to reveal the error with the test case [1,1,2].
A simple fix would be:
# Insert/push the other streaming values
if val >= -maxh[0]:
heapq.heappush(minh,val)
else
heapq.heappush(maxh,-val)

Related

Binary Search vs. Linear Search Strange Times

I have a list of about 100k integers and I am testing out concepts of linear vs binary search. It looks like my linear search is actually faster than my binary search. Im not sure I see why the that is though. Any insights?
It also looks like my cpu takes a major hit with the linear search.
def linear_search(big_list, element):
"""
Iterate over an array and return the index of the first occurance of the item.
Use to determine the time it take a standard seach to complate.
Compare to binary search to see the difference in speed.
% time ./searchalgo.py
Results: time ./main.py 0.35s user 0.09s system 31% cpu 1.395 total
"""
for i in range(len(big_list)):
if big_list[i] == element:
# Returning the element in the list.
return i
return -1
print(linear_search(big_list, 999990))
def linear_binary_search(big_list, value):
"""
Divide and concuer approach
- Sort the array.
- Is the value I am searching for equal to the middle element of the array.
- Compare is the middle element smaller orlarger than the element you are looking for?
- If smaller then perform linear search to the right.
- If larger then perform linear seach to the left.
% time ./searchalgo.py
Results: 0.57s user 0.18s system 32% cpu 2.344 total
"""
big_list = sorted(big_list)
left, right = 0, len(big_list) - 1
index = (left + right) // 2
while big_list[index] != value:
if big_list[index] == value:
return index
if big_list[index] < value:
index = index + 1
elif big_list[index] > value:
index = index - 1
print(big_list[index])
# linear_binary_search(big_list, 999990)
Output of the linear search time:
./main.py 0.26s user 0.08s system 94% cpu 0.355 total
Output of the binary search time:
./main.py 0.39s user 0.11s system 45% cpu 1.103 total

your first algorithm time complexity is O(n) because of single traversal. But in case of your second traversal, first you sort the elements which takes O(nlogn) time which and then for searching it takes O(logn). So your second algorithm time complexity will be O(nlogn) + O(logn) which is greater than time complexity of your first algorithm.

def binary_search(sorted_list,target):
if not sorted_list:
return None
mid = len(sorted_list)//2
if sorted_list[mid] == target:
return target
if sorted_list[mid] < target:
return binary_search(sorted_list[mid+1:],target)
return binary_search(sorted_list[:mid],target)
Im pretty sure will actually correctly implement binary search recursively (which is easier for my brain to process) ... there is also the builtin bisect library that basically implements binary_search for you

Improving some snipets

I have this functons to simulate mutations on DNA sequences( sequence of letter -> 'ACGTGCTTAGG', for exemple).
The first one just change a random position of the input sequence
def mutate(sequence):
seq_lst = list(sequence)
i = random.randint(0, len(seq_lst) - 1)
seq_lst[i] = random.choice(list('ATCG'))
return ''.join(seq_lst)
The second one is to simulate a insertion of a base inside a random position in the sequence.
def insertion(sequence):
seq_lst = list(sequence)
i = random.randint(0, len(seq_lst) - 1)
mutate = seq_lst[:i] + [random.choice(list('ATCG'))] + seq_lst[i:]
return ''.join(mutate)]
The last one is to select all kinds of possible random mutations that can occur in a sequence.
def mutations(sequence):
i = random.randint(0, 3)
print(i)
if i == 0:
print('SNV')
return mutate(sequence)
elif i == 1:
print('Del')
return sequence.replace(random.choice('ATCG'), '-')
elif i == 2:
print('Ins')
return insertion(sequence)
elif i == 3:
print('No mut')
return sequence
The print statements are just to check if the code is working accordling.
Any suggestion for improvement? If possible suggestions how to insert mutations probabilities in the code to simulate a more real situation.
What I saw in the return of 10000 random process is that the sequence accumulates a lot of deletions, what is wrong once single point mutations are more frequente, followed by insertions and deletions with less frequency.
Thanks

There will be excessive numbers of deletions in the sequence simulation because of this,
import random
sequence = 'ACTCAG'
sequence.replace(random.choice('ATCG'), '-')
OUT
A-T-AG
Around 1/4 of the time two deletions will occur simultaneously for a given event. Thus the probabilities are not uniform, resulting in a higher chance of deletions than insertions (or mutations). Thus a 1/4 chance of a insert or deletion and an additional 1/4 chance of a double deletion vs. a 0 change for an insertion.
There are two other bias,
You will generate reversion mutations A->A, so 1/4 mutations will not appear to mutate. This naturally occurs in DNA mutations but it is worth keeping mind.
Finally, once a deletion occurs there becomes increased mutation on the remaining nucleotides and essentially you will bias the system towards fewer and fewer nucleotides, so those that remain will undergo increased mutation.
In otherwords, the probabilities are not uniform and will be dynamic during the simulation.
You could instead used re.sub via the function below to ensure the probability between insert and deletion remain uniform,
import random, re
def rand_replacement(string, to_be_replaced, items):
return re.sub(to_be_replaced, lambda x: random.choice(items), string )

Justification of constants used in random.sample

I'm looking into the source code for the function sample in random.py (python standard library).
The idea is simple:
If a small sample (k) is needed from a large population (n): Just pick k random indices, since it is unlikely you'll pick the same number twice as the population is so large. And if you do, just pick again.
If a relatively large sample (k) is needed, compared to the total population (n): It is better to keep track of what you have picked.
My Question
There are a few constants involved, setsize = 21 and setsize += 4 ** _log(3*k,4). The critical ratio is roughly k : 21+3k. The comment says # size of a small set minus size of an empty list and # table size for big sets.
Where have these specific numbers come from? What is there justification?
The comments shed some light, however I find they bring as many questions as they answer.
I would kind of understand, size of a small set but find the "minus size of an empty list" confusing. Can someone shed any light on this?
what is meant specifically by "table" size, as apposed to say "set size".
Looking on the github repository, it looks like a very old version simply used the ratio k : 6*k, as the critical ratio, but I find that equally mysterious.
The code
def sample(self, population, k):
"""Chooses k unique random elements from a population sequence or set.
Returns a new list containing elements from the population while
leaving the original population unchanged. The resulting list is
in selection order so that all sub-slices will also be valid random
samples. This allows raffle winners (the sample) to be partitioned
into grand prize and second place winners (the subslices).
Members of the population need not be hashable or unique. If the
population contains repeats, then each occurrence is a possible
selection in the sample.
To choose a sample in a range of integers, use range as an argument.
This is especially fast and space efficient for sampling from a
large population: sample(range(10000000), 60)
"""
# Sampling without replacement entails tracking either potential
# selections (the pool) in a list or previous selections in a set.
# When the number of selections is small compared to the
# population, then tracking selections is efficient, requiring
# only a small set and an occasional reselection. For
# a larger number of selections, the pool tracking method is
# preferred since the list takes less space than the
# set and it doesn't suffer from frequent reselections.
if isinstance(population, _Set):
population = tuple(population)
if not isinstance(population, _Sequence):
raise TypeError("Population must be a sequence or set. For dicts, use list(d).")
randbelow = self._randbelow
n = len(population)
if not 0 <= k <= n:
raise ValueError("Sample larger than population or is negative")
result = [None] * k
setsize = 21 # size of a small set minus size of an empty list
if k > 5:
setsize += 4 ** _ceil(_log(k * 3, 4)) # table size for big sets
if n <= setsize:
# An n-length list is smaller than a k-length set
pool = list(population)
for i in range(k): # invariant: non-selected at [0,n-i)
j = randbelow(n-i)
result[i] = pool[j]
pool[j] = pool[n-i-1] # move non-selected item into vacancy
else:
selected = set()
selected_add = selected.add
for i in range(k):
j = randbelow(n)
while j in selected:
j = randbelow(n)
selected_add(j)
result[i] = population[j]
return result
(I apologise is this question would be better placed in math.stackexchange. I couldn't think of any probability/statistics-y reasons for this particular ratio, and the comments sounded as though, it was maybe something to do with the amount of space that sets and lists use - but could't find any details anywhere).

This code is attempting to determine whether using a list or a set would take more space (instead of trying to estimate the time cost, for some reason).
It looks like 21 was the difference between the size of an empty list and a small set on the Python build this constant was determined on, expressed in multiples of the size of a pointer. I don't have a build of that version of Python, but testing on my 64-bit CPython 3.6.3 gives a difference of 20 pointer sizes:
>>> sys.getsizeof(set()) - sys.getsizeof([])
160
and comparing the 3.6.3 list and set struct definitions to the list and set definitions from the change that introduced this code, 21 seems plausible.
I said "the difference between the size of an empty list and a small set" because both now and at the time, small sets used a hash table contained inside the set struct itself instead of externally allocated:
setentry smalltable[PySet_MINSIZE];
The
if k > 5:
setsize += 4 ** _ceil(_log(k * 3, 4)) # table size for big sets
check adds the size of the external table allocated for sets larger than 5 elements, with size again expressed in number of pointers. This computation assumes the set never shrinks, since the sampling algorithm never removes elements. I am not currently sure whether this computation is exact.
Finally,
if n <= setsize:
compares the base overhead of a set plus any space used by an external hash table to the n pointers required by a list of the input elements. (It doesn't seem to account for the overallocation performed by list(population), so it may be underestimating the cost of the list.)

Fast algorithm of multiple numbers sequential decrements until limit of sum

I have:
an ordered list of dc objects that have a float field result in it.
limit value for sum of result.
pack (not a better name ever) is a value of decreasing.
Problem:
I need to sequentially decrease results for each dc until sum of all results will be less or equal limit (without assigning result values below 0).
After some profiling I got this code:
while(self.sum > self.limit):
for dc in self.dc:
if dc.result > 0:
# max() too slow here
result = (
dc.result - self.pack
if dc.result - self.pack > 0
else 0
)
# Prevent sum() count for all list on each iteration
self.sum -= dc.result - result
dc.result = result
if self.sum <= self.limit:
break
But it has a low performance for small self.pack values (the code is doing too many iterations).
Is there a way to make this method faster?

If you are not too concerned how much you remove from the pack as long as it ensures it is less than sum, then you could just implement DC as a max heap (priority queue), and pop it every time until sum is <= self.limit. That would significantly speed up processing time especially in big data sets.
Edit:
Since dc is an ordered list, just treat it like a stack and pop from the back and remove from the pack (since the "heaviest" things are at the back).

Weighted random selection with and without replacement

Recently I needed to do weighted random selection of elements from a list, both with and without replacement. While there are well known and good algorithms for unweighted selection, and some for weighted selection without replacement (such as modifications of the resevoir algorithm), I couldn't find any good algorithms for weighted selection with replacement. I also wanted to avoid the resevoir method, as I was selecting a significant fraction of the list, which is small enough to hold in memory.
Does anyone have any suggestions on the best approach in this situation? I have my own solutions, but I'm hoping to find something more efficient, simpler, or both.

One of the fastest ways to make many with replacement samples from an unchanging list is the alias method. The core intuition is that we can create a set of equal-sized bins for the weighted list that can be indexed very efficiently through bit operations, to avoid a binary search. It will turn out that, done correctly, we will need to only store two items from the original list per bin, and thus can represent the split with a single percentage.
Let's us take the example of five equally weighted choices, (a:1, b:1, c:1, d:1, e:1)
To create the alias lookup:
Normalize the weights such that they sum to 1.0. (a:0.2 b:0.2 c:0.2 d:0.2 e:0.2) This is the probability of choosing each weight.
Find the smallest power of 2 greater than or equal to the number of variables, and create this number of partitions, |p|. Each partition represents a probability mass of 1/|p|. In this case, we create 8 partitions, each able to contain 0.125.
Take the variable with the least remaining weight, and place as much of it's mass as possible in an empty partition. In this example, we see that a fills the first partition. (p1{a|null,1.0},p2,p3,p4,p5,p6,p7,p8) with (a:0.075, b:0.2 c:0.2 d:0.2 e:0.2)
If the partition is not filled, take the variable with the most weight, and fill the partition with that variable.
Repeat steps 3 and 4, until none of the weight from the original partition need be assigned to the list.
For example, if we run another iteration of 3 and 4, we see
(p1{a|null,1.0},p2{a|b,0.6},p3,p4,p5,p6,p7,p8) with (a:0, b:0.15 c:0.2 d:0.2 e:0.2) left to be assigned
At runtime:
Get a U(0,1) random number, say binary 0.001100000
bitshift it lg2(p), finding the index partition. Thus, we shift it by 3, yielding 001.1, or position 1, and thus partition 2.
If the partition is split, use the decimal portion of the shifted random number to decide the split. In this case, the value is 0.5, and 0.5 < 0.6, so return a.
Here is some code and another explanation, but unfortunately it doesn't use the bitshifting technique, nor have I actually verified it.

A simple approach that hasn't been mentioned here is one proposed in Efraimidis and Spirakis. In python you could select m items from n >= m weighted items with strictly positive weights stored in weights, returning the selected indices, with:
import heapq
import math
import random
def WeightedSelectionWithoutReplacement(weights, m):
elt = [(math.log(random.random()) / weights[i], i) for i in range(len(weights))]
return [x[1] for x in heapq.nlargest(m, elt)]
This is very similar in structure to the first approach proposed by Nick Johnson. Unfortunately, that approach is biased in selecting the elements (see the comments on the method). Efraimidis and Spirakis proved that their approach is equivalent to random sampling without replacement in the linked paper.

Here's what I came up with for weighted selection without replacement:
def WeightedSelectionWithoutReplacement(l, n):
"""Selects without replacement n random elements from a list of (weight, item) tuples."""
l = sorted((random.random() * x[0], x[1]) for x in l)
return l[-n:]
This is O(m log m) on the number of items in the list to be selected from. I'm fairly certain this will weight items correctly, though I haven't verified it in any formal sense.
Here's what I came up with for weighted selection with replacement:
def WeightedSelectionWithReplacement(l, n):
"""Selects with replacement n random elements from a list of (weight, item) tuples."""
cuml = []
total_weight = 0.0
for weight, item in l:
total_weight += weight
cuml.append((total_weight, item))
return [cuml[bisect.bisect(cuml, random.random()*total_weight)] for x in range(n)]
This is O(m + n log m), where m is the number of items in the input list, and n is the number of items to be selected.

I'd recommend you start by looking at section 3.4.2 of Donald Knuth's Seminumerical Algorithms.
If your arrays are large, there are more efficient algorithms in chapter 3 of Principles of Random Variate Generation by John Dagpunar. If your arrays are not terribly large or you're not concerned with squeezing out as much efficiency as possible, the simpler algorithms in Knuth are probably fine.

It is possible to do Weighted Random Selection with replacement in O(1) time, after first creating an additional O(N)-sized data structure in O(N) time. The algorithm is based on the Alias Method developed by Walker and Vose, which is well described here.
The essential idea is that each bin in a histogram would be chosen with probability 1/N by a uniform RNG. So we will walk through it, and for any underpopulated bin which would would receive excess hits, assign the excess to an overpopulated bin. For each bin, we store the percentage of hits which belong to it, and the partner bin for the excess. This version tracks small and large bins in place, removing the need for an additional stack. It uses the index of the partner (stored in bucket[1]) as an indicator that they have already been processed.
Here is a minimal python implementation, based on the C implementation here
def prep(weights):
data_sz = len(weights)
factor = data_sz/float(sum(weights))
data = [[w*factor, i] for i,w in enumerate(weights)]
big=0
while big<data_sz and data[big][0]<=1.0: big+=1
for small,bucket in enumerate(data):
if bucket[1] is not small: continue
excess = 1.0 - bucket[0]
while excess > 0:
if big==data_sz: break
bucket[1] = big
bucket = data[big]
bucket[0] -= excess
excess = 1.0 - bucket[0]
if (excess >= 0):
big+=1
while big<data_sz and data[big][0]<=1: big+=1
return data
def sample(data):
r=random.random()*len(data)
idx = int(r)
return data[idx][1] if r-idx > data[idx][0] else idx
Example usage:
TRIALS=1000
weights = [20,1.5,9.8,10,15,10,15.5,10,8,.2];
samples = [0]*len(weights)
data = prep(weights)
for _ in range(int(sum(weights)*TRIALS)):
samples[sample(data)]+=1
result = [float(s)/TRIALS for s in samples]
err = [a-b for a,b in zip(result,weights)]
print(result)
print([round(e,5) for e in err])
print(sum([e*e for e in err]))

The following is a description of random weighted selection of an element of a
set (or multiset, if repeats are allowed), both with and without replacement in O(n) space
and O(log n) time.
It consists of implementing a binary search tree, sorted by the elements to be
selected, where each node of the tree contains:
the element itself (element)
the un-normalized weight of the element (elementweight), and
the sum of all the un-normalized weights of the left-child node and all of
its children (leftbranchweight).
the sum of all the un-normalized weights of the right-child node and all of
its chilren (rightbranchweight).
Then we randomly select an element from the BST by descending down the tree. A
rough description of the algorithm follows. The algorithm is given a node of
the tree. Then the values of leftbranchweight, rightbranchweight,
and elementweight of node is summed, and the weights are divided by this
sum, resulting in the values leftbranchprobability,
rightbranchprobability, and elementprobability, respectively. Then a
random number between 0 and 1 (randomnumber) is obtained.
if the number is less than elementprobability,
remove the element from the BST as normal, updating leftbranchweight
and rightbranchweight of all the necessary nodes, and return the
element.
else if the number is less than (elementprobability + leftbranchweight)
recurse on leftchild (run the algorithm using leftchild as node)
else
recurse on rightchild
When we finally find, using these weights, which element is to be returned, we either simply return it (with replacement) or we remove it and update relevant weights in the tree (without replacement).
DISCLAIMER: The algorithm is rough, and a treatise on the proper implementation
of a BST is not attempted here; rather, it is hoped that this answer will help
those who really need fast weighted selection without replacement (like I do).

This is an old question for which numpy now offers an easy solution so I thought I would mention it. Current version of numpy is version 1.2 and numpy.random.choice allows the sampling to be done with or without replacement and with given weights.

Suppose you want to sample 3 elements without replacement from the list ['white','blue','black','yellow','green'] with a prob. distribution [0.1, 0.2, 0.4, 0.1, 0.2]. Using numpy.random module it is as easy as this:
import numpy.random as rnd
sampling_size = 3
domain = ['white','blue','black','yellow','green']
probs = [.1, .2, .4, .1, .2]
sample = rnd.choice(domain, size=sampling_size, replace=False, p=probs)
# in short: rnd.choice(domain, sampling_size, False, probs)
print(sample)
# Possible output: ['white' 'black' 'blue']
Setting the replace flag to True, you have a sampling with replacement.
More info here:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.random.choice.html#numpy.random.choice

We faced a problem to randomly select K validators of N candidates once per epoch proportionally to their stakes. But this gives us the following problem:
Imagine probabilities of each candidate:
0.1
0.1
0.8
Probabilities of each candidate after 1'000'000 selections 2 of 3 without replacement became:
0.254315
0.256755
0.488930
You should know, those original probabilities are not achievable for 2 of 3 selection without replacement.
But we wish initial probabilities to be a profit distribution probabilities. Else it makes small candidate pools more profitable. So we realized that random selection with replacement would help us – to randomly select >K of N and store also weight of each validator for reward distribution:
std::vector<int> validators;
std::vector<int> weights(n);
int totalWeights = 0;
for (int j = 0; validators.size() < m; j++) {
int value = rand() % likehoodsSum;
for (int i = 0; i < n; i++) {
if (value < likehoods[i]) {
if (weights[i] == 0) {
validators.push_back(i);
}
weights[i]++;
totalWeights++;
break;
}
value -= likehoods[i];
}
}
It gives an almost original distribution of rewards on millions of samples:
0.101230
0.099113
0.799657

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.