The following is the code which I've written to implement Flajolet and Martin’s Algorithm. I've used Jenkins hash function to generate a 32 bit hash value of data. The program seems to follow the algorithm but is off the mark by about 20%. My data set consists of more than 200,000 unique records whereas the program outputs about 160,000 unique records. Please help me in understanding the mistake(s) being made by me. The hash function is implemented as per Bob Jerkins' website.
import numpy as np
from jenkinshash import jhash
class PCSA():
def __init__(self, nmap, maxlength):
self.nmap = nmap
self.maxlength = maxlength
self.bitmap = np.zeros((nmap, maxlength), dtype=np.int)
def count(self, data):
hashedValue = jhash(data)
indexAlpha = hashedValue % self.nmap
ix = hashedValue / self.nmap
ix = bin(ix)[2:][::-1]
indexBeta = ix.find("1") #find index of lsb
if self.bitmap[indexAlpha, indexBeta] == 0:
self.bitmap[indexAlpha, indexBeta] = 1
def getCardinality(self):
sumIx = 0
for row in range(self.nmap):
sumIx += np.where(self.bitmap[row, :] == 0)[0][0]
A = sumIx / self.nmap
cardinality = self.nmap * (2 ** A)/ MAGIC_CONST
return cardinality
If you are running this in Python2, then the division to calculate A may result in A being changed to an integer.
If this is the case, you could try changing:
A = sumIx / self.nmap
to
A = float(sumIx) / self.nmap
Related
I am trying to solve LeetCode problem 295. Find Median from Data Stream:
The median is the middle value in an ordered integer list. If the size of the list is even, there is no middle value and the median is the mean of the two middle values.
For example, for arr = [2,3,4], the median is 3.
For example, for arr = [2,3], the median is (2 + 3) / 2 = 2.5.
Implement the MedianFinder class:
MedianFinder() initializes the MedianFinder object.
void addNum(int num) adds the integer num from the data stream to the data structure.
double findMedian() returns the median of all elements so far.
Answers within 10-5 of the actual answer will be accepted.
Example 1:
[...]
MedianFinder medianFinder = new MedianFinder();
medianFinder.addNum(1); // arr = [1]
medianFinder.addNum(2); // arr = [1, 2]
medianFinder.findMedian(); // return 1.5 (i.e., (1 + 2) / 2)
medianFinder.addNum(3); // arr[1, 2, 3]
medianFinder.findMedian(); // return 2.0
For this question, I am adding numbers to a heap and then later use the smallest one to do some operations.
When I tried to do the operation, I found out that my heap returns a different value when doing heapq.heappop(self.small) than when doing self.small[0].
Could you please explain this to me? Any hint is much appreciated.
(Every number in self.small is added using heapq.heappush)
Here is my code when it works:
class MedianFinder:
def __init__(self):
self.small, self.large = [], []
def addNum(self, num):
heapq.heappush(self.small, -1 * num)
if (self.small and self.large) and -1 * self.small[0] > self.large[0]:
val = -1 * heapq.heappop(self.small)
heapq.heappush(self.large, val)
if len(self.small) > len(self.large) + 1:
val = -1 * heapq.heappop(self.small)
heapq.heappush(self.large, val)
if len(self.large) > len(self.small) + 1:
val = -1 * heapq.heappop(self.large)
heapq.heappush(self.small, val)
def findMedian(self):
if len(self.small) > len(self.large):
return -1 * self.small[0]
elif len(self.small) < len(self.large):
return self.large[0]
else:
return (-1 * self.small[0] + self.large[0]) / 2
For the last line, if I change:
-1 * self.small[0] + self.large[0]
into:
-1 * heapq.heappop(self.small) + heapq.heappop(self.large)
then the tests fail.
Why would that be any different?
When you change -1 * self.small[0] + self.large[0] into -1 * heapq.heappop(self.small) + heapq.heappop(self.large) then it will still work the first time findMedian is called, but when it is called again, it will return (in general) a different result. The reason is that with heappop, you remove a value from the heap. This should not happen, as this changes the data that a next call of findMedian will have to deal with. findMedian is supposed to leave the data structure unchanged.
Note how the challenge says this:
double findMedian() returns the median of all elements so far.
I highlight the last two words. These indicate that findMedian is not (only) called when the whole stream of data has been processed, but will be called several times during the processing of the data stream. That makes it crucial that findMedian does not modify the data structure, and so heappop should not be used.
Is there a faster way to implement this? Each row is about 1024 buckets and it's not as fast as I wish it was..
I'd like to generate quite a lot but it needs a few hours to complete as it is. It's quite the bottleneck at this point.. Any suggestions or ideas for how to optimize it would be greatly appreciated!
Edit*
Apologies for not having a minimal working example before. Now it's posted. If optimization could be made for Python 2.7 it would be very appreciated.
import math
import numpy as np
import copy
import random
def number_to_move(n):
l=math.exp(-n)
k=0
p=1.0
while p>l:
k += 1
p *= random.random()
return k-n
def createShuffledDataset(input_data, shuffle_indexes_dict, shuffle_quantity):
shuffled = []
for key in shuffle_indexes_dict:
for values in shuffle_indexes_dict[key]:
temp_holder = copy.copy(input_data[values[0] - 40: values[1]]) #may need to increase 100 padding
for line in temp_holder:
buckets = range(1,1022)
for bucket in buckets:
bucket_value = line[bucket]
proposed_number = number_to_move(bucket_value)
moving_amount = abs(proposed_number) if bucket_value - abs(proposed_number) >= 0 else bucket_value
line[bucket] -= moving_amount
if proposed_number > 0:
line[bucket + 1] += moving_amount
else:
line[bucket - 1] += moving_amount
shuffled.extend(temp_holder)
return np.array(shuffled)
example_data = np.ones((100,1024))
shuffle_indexes = {"Ranges to Shuffle 1" : [[10,50], [53, 72]]}
shuffle_quantity = 1150
shuffled_data = createShuffledDataset(example_data, shuffle_indexes,
shuffle_quantity)
Some minors things you might try:
merge key and values into single call:
for key, values in shuffle_indexes_dict.iteritems():
use xrange rather range
bucket values seem to be integer - try caching them.
_cache = {}
def number_to_move(n):
v = _cache.get(n, None)
if v is not None: return v
l=math.exp(-n)
k=0
p=1.0
while p>l:
k += 1
p *= random.random()
v = k-n
_cache[n] = v
return v
If shuffle_indexes_dict ranges are exclusive, then you can - probably - avoid coping values from input_data.
Otherwise i'd say you're out of luck.
This portion I was able to vectorize and get rid of a nested loop.
def EMalgofast(obsdata, beta, pjt):
n = np.shape(obsdata)[0]
g = np.shape(pjt)[0]
zijtpo = np.zeros(shape=(n,g))
for j in range(g):
zijtpo[:,j] = pjt[j]*stats.expon.pdf(obsdata,scale=beta[j])
zijdenom = np.sum(zijtpo, axis=1)
zijtpo = zijtpo/np.reshape(zijdenom, (n,1))
pjtpo = np.mean(zijtpo, axis=0)
I wasn't able to vectorize the portion below. I need to figure that out
betajtpo_1 = []
for j in range(g):
num = 0
denom = 0
for i in range(n):
num = num + zijtpo[i][j]*obsdata[i]
denom = denom + zijtpo[i][j]
betajtpo_1.append(num/denom)
betajtpo = np.asarray(betajtpo_1)
return(pjtpo,betajtpo)
I'm guessing Python is not your first programming language based on what I see. The reason I'm saying this is that in python, normally we don't have to deal with manipulating indexes. You act directly on the value or the key returned. Make sure not to take this as an offense, I do the same coming from C++ myself. It's a hard to remove habits ;).
If you're interested in performance, there is a good presentation by Raymond Hettinger on what to do in Python to be optimised and beautiful :
https://www.youtube.com/watch?v=OSGv2VnC0go
As for the code you need help with, is this helping for you? It's unfortunatly untested as I need to leave...
ref:
Iterating over a numpy array
http://docs.scipy.org/doc/numpy/reference/generated/numpy.true_divide.html
def EMalgofast(obsdata, beta, pjt):
n = np.shape(obsdata)[0]
g = np.shape(pjt)[0]
zijtpo = np.zeros(shape=(n,g))
for j in range(g):
zijtpo[:,j] = pjt[j]*stats.expon.pdf(obsdata,scale=beta[j])
zijdenom = np.sum(zijtpo, axis=1)
zijtpo = zijtpo/np.reshape(zijdenom, (n,1))
pjtpo = np.mean(zijtpo, axis=0)
betajtpo_1 = []
#manipulating an array of numerator and denominator instead of creating objects each iteration
num=np.zeros(shape=(g,1))
denom=np.zeros(shape=(g,1))
#generating the num and denom real value for the end result
for (x,y), value in numpy.ndenumerate(zijtpo):
num[x],denom[x] = num[x] + value *obsdata[y],denom[x] + value
#dividing all at once after instead of inside the loop
betajtpo_1= np.true_divide(num/denom)
betajtpo = np.asarray(betajtpo_1)
return(pjtpo,betajtpo)
Please leave me some feedback !
Regards,
Eric Lafontaine
I've written some python code to calculate a certain quantity from a cosmological simulation. It does this by checking whether a particle in contained within a box of size 8,000^3, starting at the origin and advancing the box when all particles contained within it are found. As I am counting ~2 million particles altogether, and the total size of the simulation volume is 150,000^3, this is taking a long time.
I'll post my code below, does anybody have any suggestions on how to improve it?
Thanks in advance.
from __future__ import division
import numpy as np
def check_range(pos, i, j, k):
a = 0
if i <= pos[2] < i+8000:
if j <= pos[3] < j+8000:
if k <= pos[4] < k+8000:
a = 1
return a
def sigma8(data):
N = []
to_do = data
print 'Counting number of particles per cell...'
for k in range(0,150001,8000):
for j in range(0,150001,8000):
for i in range(0,150001,8000):
temp = []
n = []
for count in range(len(to_do)):
n.append(check_range(to_do[count],i,j,k))
to_do[count][1] = n[count]
if to_do[count][1] == 0:
temp.append(to_do[count])
#Only particles that have not been found are
# searched for again
to_do = temp
N.append(sum(n))
print 'Next row'
print 'Next slice, %i still to find' % len(to_do)
print 'Calculating sigma8...'
if not sum(N) == len(data):
return 'Error!\nN measured = {0}, total N = {1}'.format(sum(N), len(data))
else:
return 'sigma8 = %.4f, variance = %.4f, mean = %.4f' % (np.sqrt(sum((N-np.mean(N))**2)/len(N))/np.mean(N), np.var(N),np.mean(N))
I'll try to post some code, but my general idea is the following: create a Particle class that knows about the box that it lives in, which is calculated in the __init__. Each box should have a unique name, which might be the coordinate of the bottom left corner (or whatever you use to locate your boxes).
Get a new instance of the Particle class for each particle, then use a Counter (from the collections module).
Particle class looks something like:
# static consts - outside so that every instance of Particle doesn't take them along
# for the ride...
MAX_X = 150,000
X_STEP = 8000
# etc.
class Particle(object):
def __init__(self, data):
self.x = data[xvalue]
self.y = data[yvalue]
self.z = data[zvalue]
self.compute_box_label()
def compute_box_label(self):
import math
x_label = math.floor(self.x / X_STEP)
y_label = math.floor(self.y / Y_STEP)
z_label = math.floor(self.z / Z_STEP)
self.box_label = str(x_label) + '-' + str(y_label) + '-' + str(z_label)
Anyway, I imagine your sigma8 function might look like:
def sigma8(data):
import collections as col
particles = [Particle(x) for x in data]
boxes = col.Counter([x.box_label for x in particles])
counts = boxes.most_common()
#some other stuff
counts will be a list of tuples which map a box label to the number of particles in that box. (Here we're treating particles as indistinguishable.)
Using list comprehensions is much faster than using loops---I think the reason is that you're basically relying more on the underlying C, but I'm not the person to ask. Counter is (supposedly) highly-optimized as well.
Note: None of this code has been tested, so you shouldn't try the cut-and-paste-and-hope-it-works method here.
What's an efficient way to calculate a trimmed or winsorized standard deviation of a list?
I don't mind using numpy, but if I have to make a separate copy of the list, it's going to be quite slow.
This will make two copies, but you should give it a try because it should be very fast.
def trimmed_std(data, low, high):
tmp = np.asarray(data)
return tmp[(low <= tmp) & (tmp < high)].std()
Do you need to do rank order trimming (ie 5% trimmed)?
Update:
If you need percentile trimming, the best way I can think of is to sort the data first. Something like this should work:
def trimmed_std(data, percentile):
data = np.array(data)
data.sort()
percentile = percentile / 2.
low = int(percentile * len(data))
high = int((1. - percentile) * len(data))
return data[low:high].std(ddof=0)
You can obviously implement this without using numpy, but even including the time of converting the list to an array, using numpy is faster than anything I could think of.
This is what generator functions are for.
SD requires two passes, plus a count. For this reason, you'll need to "tee" some iterators over the base collection.
So.
trimmed = ( x for x in the_list if low <= x < high )
sum_iter, len_iter, var_iter = itertools.tee( trimmed, 3 )
n = sum( 1 for x in len_iter)
mean = sum( sum_iter ) / n
sd = math.sqrt( sum( (x-mean)**2 for x in var_iter ) / (n-1) )
Something like that might do what you want without copying anything.
In order to get an unbiased trimmed mean you have to account for fractional bits of items in the list as described here and (a little less directly) here. I wrote a function to do it:
def percent_tmean( data, pcent ):
# make sure data is a list
dc = list( data )
# find the number of items
n = len(dc)
# sort the list
dc.sort()
# get the proportion to trim
p = pcent / 100.0
k = n*p
# print "n = %i\np = %.3f\nk = %.3f" % ( n,p,k )
# get the decimal and integer parts of k
dec_part, int_part = modf( k )
# get an index we can use
index = int(int_part)
# trim down the list
dc = dc[ index: index * -1 ]
# deal with the case of trimming fractional items
if dec_part != 0.0:
# deal with the first remaining item
dc[ 0 ] = dc[ 0 ] * (1 - dec_part)
# deal with last remaining item
dc[ -1 ] = dc[ -1 ] * (1 - dec_part)
return sum( dc ) / ( n - 2.0*k )
I also made an iPython Notebook that demonstrates it.
My function will probably be slower than those already posted but it will give unbiased results.