Price interpolation. Python data structure for efficient near miss searches?
I have price data
[1427837961000.0, 243.586], [1427962162000.0, 245.674], [1428072262000.0, 254.372], [1428181762000.0, 253.366], ...
with the first dimension a timestamp, and the second a price.
Now I want to know the price which is nearest to a given timestamp e.g. to 1427854534654.
What is the best Python container, data structure, or algorithm to solve this many hundred or thousand times per second? It is a standard problem, and has to be solved in many applications, so there should be a ready and optimized solution.
I have Googled, and found only bits and pieces that I could build upon - but I guess this question is so common, that the whole data structure should be ready as a module?
EDIT: Solved.
I used JuniorCompressor's solution with my bugfix for future dates.
The performance is fantastic:
3000000 calls took 12.82 seconds, so 0.00000427 per call (length of data = 1143).
Thanks a lot! StackOverFlow is great, and you helpers are the best!
It is very common for this problem to have your data sorted by the timestamp value and then binary search for every possible query. Binary search can be performed using the bisect module:
data = [
[1427837961000.0, 243.586],
[1427962162000.0, 245.674],
[1428072262000.0, 254.372],
[1428181762000.0, 253.366]
]
data.sort(key=lambda l: l[0]) # Sort by timestamp
timestamps = [l[0] for l in data] # Extract timestamps
import bisect
def find_closest(t):
idx = bisect.bisect_left(timestamps, t) # Find insertion point
# Check which timestamp with idx or idx - 1 is closer
if idx > 0 and abs(timestamps[idx] - t) > abs(timestamps[idx - 1] - t):
idx -= 1
return data[idx][1] # Return price
We can test like this:
>>> find_closest(1427854534654)
243.586
If we have n queries and m timestamp values, then each query needs O(log m) time. So the total time needed is O(n * log m).
In the above algorithm we search between two indexes. If we use only the midpoints of the timestamp intervals, we can simplify even more and create a faster search:
midpoints = [(a + b) / 2 for a, b in zip(timestamps, timestamps[1:])]
def find_closest_through_midpoints(t):
return data[bisect.bisect_left(midpoints, t)][1]
Try this to get nearest value
l = [ [1427837961000.0, 243.586], [1427962162000.0, 245.674], [1428072262000.0, 254.372], [1428181762000.0, 253.366]]
check_value = 1427854534654
>>>min(l, key=lambda x:abs(x[0]-check_value))[0]
1427837961000.0
Solved!
I used JuniorCompressor's solution with my bugfix for future dates.
The performance is fantastic:
3000000 calls took 12.82 seconds, so 0.00000427 per call (length of data = 1143).
Thanks a lot! StackOverFlow is great, and you helpers are the best!
Related
I was looking at the code for a particular cryptocurrency casino game (EthCrash - if you're interested). The game generates crash points using a function (I call this crash(x)) where x is an integer that is randomly drawn from the space of integers (0,2^52).
I'd like to calculate the expected value of the crash points. The code below should explain everything, but a clean picture of the function is here: https://i.imgur.com/8dPBALa.png, and what I'm trying to calculate is here: https://i.imgur.com/nllykDQ.png (apologies - can't paste pictures yet).
I wrote the following code:
import math
two52 = 2**52
def crash(x):
crash_point = math.floor((100*two52-x)/(two52-x))
return(crash_point/100)
crashes_sum = 0
for i in range(two52+1):
crashes_sum += crash(i)
expected_crash = crashes_sum/two52
Unfortunately, the loop is taking too long to run - any ideas for how I can do this faster?
ok, if you cannot do it straightforward, time to get smart, right?
So idea to get ranges where whole sum could be computed fast. I will put some pseudocode which not even compiles, could have bugs etc. Use it as illustration.
First, lets rewrite the term in the sum as
floor( 100 + 99*x/(252 - x) )
First idea - get ranges where floor is not changing due to the fact that term
n =< 99*x/(252 - x) < n+1. Obviously, for this whole range we could add to sum range_length*(100 + n), no need to do it term by term
sum = 0
r_lo = 0
for k in range(0, 2*52): # LOOP OVER RANGES
r_hi = floor(2**52/(1 + 99/n))
sum += (100 + n -1)*(r_hi - r_lo)
if r_hi-r_lo == 1:
break
r_lo = r_hi + 1
Obviously, range size will shrink till it is equal to 1, and then this method will be useless, we break out. Obviously, by that time each term would be different from previous one by 1 or more.
Ok, second idea - again ranges, where sum is arithmetic series. First we have to find range where increment is equal to 1. Then range where increment is equal to 2, etc. Looks like you have to find roots of quadratic equation for this, but code would be about the same
r_lo = pos_for_increment(1)
t_lo = ... # term at r_lo
for n in range(2, 2*52): # LOOP OVER RANGES
r_hi = pos_for_increment(n) - 1
t_hi = ... # term at r_lo
sum += (t_lo + t_hi)*(r_hi - r_lo) / 2 # arith.series sum
if r_hi > 2**52:
break
r_lo = r_hi + 1
t_lo = t_hi + n
might think about something else, but those tricks are worth trying
Using the map function might help increase the speed since it makes the computation in parallel
import math
two52 = 2**52
def crash(x):
crash_point = math.floor((100*two52-x)/(two52-x))
return(crash_point/100)
crashes_sum = sum(map(crash,range(two52)))
expected_crash = crashes_sum/two52
I have been able to speed up your code by taking advantage of numpy vectorization:
import numpy as np
import time
two52 = 2**52
crash = lambda x: np.floor( ( 100 * two52 - x ) / ( two52 - x ) ) / 100
starttime = time.time()
icur = 0
ispan = 100000
crashes_sum = 0
while icur < two52-ispan:
i = np.arange(icur, icur+ispan, 1)
crashes_sum += np.sum(crash(i))
icur += ispan
crashes_sum += np.sum(crash(np.arange(icur, two52, 1)))
expected_crash = crashes_sum / two52
print(time.time() - starttime)
The trick is to compute the sum on a moving windows to take advantage of numpy's vectorization (written in C). I tried up to 2**30 and it takes 9 seconds on my laptop (and too long for your code to be able to benchmark).
Python is probably not the most suitable language for what you want to do, you may want to try C or Fortran for that (and take advantage of threading).
You will have to use a powerful GPU if you wan't the result within some hours.
A possible CPU implementation
import numpy as np
import numba as nb
import time
two52 = 2**52
loop_to=2**30
#nb.njit(fastmath=True,parallel=True)
def sum_over_crash(two52,loop_to): #loop_to is only for testing performance
crashes_sum = nb.float64(0)
for i in nb.prange(loop_to):#nb.prange(two52+1):
crashes_sum += np.floor((100*two52-i)/(two52-i))/100
return crashes_sum/two52
sum_over_crash(two52,2)#don't measure static compilation overhead
t1=time.time()
sum_over_crash(two52,2**30)
print(time.time()-t1)
This takes 0.57s for on my quadcore i7. eg. 28 days for the whole calculation.
As the calculation can not be minimized mathematically, the only option is to calculate it step by step.
This takes a long time (as stated in other answers). Your best bet on calculating it fast is to use a lower level language than python. Since python is an interpreted language, it is rather slow to calculate this kind of thing.
Additionally you can use multithreading (if availible in the chosen language) to make it even faster.
Cloud Computing is also an option that could be suitable for this, as you are only going to calculate the number once. Amazon and Google (and many more) provide this kind of service for a relatively small fee.
But before performing any of the calculations you need to adjust your formula, as with the way it stands right now, you're going to get a ZeroDivisionError at the very last iteration of your loop.
Using: Python 2.7.1 on Windows
Hello, I fear this question has a very simple answer, but I just can't seem to find an appropriate and efficient solution (I have limited python experience). I am writing an application that just downloads historic weather data from a third party API (wundergorund). The thing is, sometimes there's no value for a given hour (eg, we have 20 degrees at 5 AM, no value for 6 AM, and 21 degrees at 7 AM). I need to have exactly one temperature value in any given hour, so I figured I could just fit the data I do have and evaluate the points I'm missing (using SciPy's polyfit). That's all cool, however, I am having problems handling my program to detect if the list has missing hours, and if so, insert the missing hour and calculate a temperature value. I hope that makes sense..
My attempt at handling the hours and temperatures list is the following:
from scipy import polyfit
# Evaluate simple cuadratic function
def tempcal (array,x):
return array[0]*x**2 + array[1]*x + array[2]
# Sample data, note it has missing hours.
# My final hrs list should look like range(25), with matching temperatures at every point
hrs = [1,2,3,6,9,11,13,14,15,18,19,20]
temps = [14.0,14.5,14.5,15.4,17.8,21.3,23.5,24.5,25.5,23.4,21.3,19.8]
# Fit coefficients
coefs = polyfit(hrs,temps,2)
# Cycle control
i = 0
done = False
while not done:
# It has missing hour, insert it and calculate a temperature
if hrs[i] != i:
hrs.insert(i,i)
temps.insert(i,tempcal(coefs,i))
# We are done, leave now
if i == 24:
done = True
i += 1
I can see why this isn't working, the program will eventually try to access indexes out of range for the hrs list. I am also aware that modifying list's length inside a loop has to be done carefully. Surely enough I am either not being careful enough or just overlooking a simpler solution altogether.
In my googling attempts to help myself I came across pandas (the library) but I feel like I can solve this problem without it, (and I would rather do so).
Any input is greatly appreciated. Thanks a lot.
When I is equal 21. It means twenty second value in list. But there is only 21 values.
In future I recommend you to use PyCharm with breakpoints for debug. Or try-except construction.
Not sure i would recommend this way of interpolating values. I would have used the closest points surrounding the missing values instead of the whole dataset. But using numpy your proposed way is fairly straight forward.
hrs = np.array(hrs)
temps = np.array(temps)
newTemps = np.empty((25))
newTemps.fill(-300) #just fill it with some invalid data, temperatures don't go this low so it should be safe.
#fill in original values
newTemps[hrs - 1] = temps
#Get indicies of missing values
missing = np.nonzero(newTemps == -300)[0]
#Calculate and insert missing values.
newTemps[missing] = tempcal(coefs, missing + 1)
I implemented in Red-Black trees in Python according to the pseudo code in Cormen's Introduction to Algorithms.
I wanted to see in my own eyes that my insert is really O(logn) so I plotted the time it takes to insert n=1, 10, 20, ..., 5000 nodes into the tree.
This is the result:
the x-axis is n and the y-axis is the time it took in milliseconds.
To me the graph looks more linear than logarithmic. What can explain that?
Ok, so the graph displays a measurement of the cost of inserting n elements into your tree, where the x axis is how many elements we've inserted, and the y axis is the total time.
Let's call the function that totals the time it takes to insert n elements into the tree f(n).
Then we can get a rough idea of what f might look like:
f(1) < k*log(1) for some constant k.
f(2) < k*log(1) + k*log(2) for some constant k
...
f(n) < k * [log(1) + log(2) + ... + log(n)] for some constant k.
Due to how logs work, we can collapse log(1) + ... + log(n):
f(n) < k * [log(1*2*3*...*n)] for some constant k
f(n) < k * log(n!) for some constant k
We can take a look at Wikipedia to see a graph of what log(n!) looks like. Take a look at the graph in the article. Should look pretty familiar to you. :)
That is, I think you've done this by accident:
for n in (5000, 50000, 500000):
startTime = ...
## .. make a fresh tree
## insert n elements into the tree
stopTime = ...
## record the tuple (n, stopTime - startTime) for plotting
and plotted total time to construct the tree of size n, rather than the individual cost of inserting one element into a tree of size n:
for n in range(50000):
startTime = ...
## insert an element into the tree
stopTime = ...
## record the tuple (n, stopTime - startTime) for plotting
Chris Taylor notes in the comments that if you plot f(n)/n, you'll see a log graph. That's because a fairly tight approximation to log(n!) is n*log(n) (see the Wikipedia page). So we can go back to our bound:
f(n) < k * log(n!) for some constant k
and get:
f(n) < k * n * log(n) for some constant k
And now it's should be easier to see that if you divide f(n) by n, your graph will be bounded above by the shape of a logarithm.
5000 might not be large enough to really "see" the logarithm -- try runs at 50000 and 500000. If it takes two seconds and twenty seconds, then linear growth makes sense. If it takes less, then logarithmic makes sense. If you zoom in closely enough on most "simple" functions, the results look pretty linear.
There are always a handful of speculations to any 'why' question. I would suspect the jumps you are seeing are related to system memory management. If the system has to allocate a larger memory space for continued growth, it would add a given amount of time to complete the processing of the whole program. If you added a 'payload' field to you nodes, thus increasing the amount of storage space needed, and I am correct, the jumps will happen more often.
Nice graph, by the way.
I have a matrix which is fairly large (around 50K rows), and I want to print the correlation coefficient between each row in the matrix. I have written Python code like this:
for i in xrange(rows): # rows are the number of rows in the matrix.
for j in xrange(i, rows):
r = scipy.stats.pearsonr(data[i,:], data[j,:])
print r
Please note that I am making use of the pearsonr function available from the scipy module (http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html).
My question is: Is there a quicker way of doing this? Is there some matrix partition technique that I can use?
Thanks!
New Solution
After looking at Joe Kington's answer, I decided to look into the corrcoef() code and was inspired by it to do the following implementation.
ms = data.mean(axis=1)[(slice(None,None,None),None)]
datam = data - ms
datass = np.sqrt(scipy.stats.ss(datam,axis=1))
for i in xrange(rows):
temp = np.dot(datam[i:],datam[i].T)
rs = temp / (datass[i:]*datass[i])
Each loop through generates the Pearson coefficients between row i and rows i through to the last row. It is very fast. It is at least 1.5x as fast as using corrcoef() alone because it doesn't redundantly calculate the coefficients and a few other things. It will also be faster and won't give you the memory problems with a 50,000 row matrix because then you can choose to either store each set of r's or process them before generating another set. Without storing any of the r's long term, I was able to get the above code to run on 50,000 x 10 set of randomly generated data in under a minute on my fairly new laptop.
Old Solution
First, I wouldn't recommend printing out the r's to the screen. For 100 rows (10 columns), this is a difference of 19.79 seconds with printing vs. 0.301 seconds without using your code. Just store the r's and use them later if you would like, or do some processing on them as you go along like looking for some of the largest r's.
Second, you can get some savings by not redundantly calculating some quantities. The Pearson coefficient is calculated in scipy using some quantities that you can precalculate rather than calculating every time that a row is used. Also, you aren't using the p-value (which is also returned by pearsonr() so let's scratch that too. Using the below code:
r = np.zeros((rows,rows))
ms = data.mean(axis=1)
datam = np.zeros_like(data)
for i in xrange(rows):
datam[i] = data[i] - ms[i]
datass = scipy.stats.ss(datam,axis=1)
for i in xrange(rows):
for j in xrange(i,rows):
r_num = np.add.reduce(datam[i]*datam[j])
r_den = np.sqrt(datass[i]*datass[j])
r[i,j] = min((r_num / r_den), 1.0)
I get a speed-up of about 4.8x over the straight scipy code when I've removed the p-value stuff - 8.8x if I leave the p-value stuff in there (I used 10 columns with hundreds of rows). I also checked that it does give the same results. This isn't a really huge improvement, but it might help.
Ultimately, you are stuck with the problem that you are computing (50000)*(50001)/2 = 1,250,025,000 Pearson coefficients (if I'm counting correctly). That's a lot. By the way, there's really no need to compute each row's Pearson coefficient with itself (it will equal 1), but that only saves you from computing 50,000 Pearson coefficients. With the above code, I expect that it would take about 4 1/4 hours to do your computation if you have 10 columns to your data based on my results on smaller datasets.
You can get some improvement by taking the above code into Cython or something similar. I expect that you'll maybe get up to a 10x improvement over straight Scipy if you're lucky. Also, as suggested by pyInTheSky, you can do some multiprocessing.
Have you tried just using numpy.corrcoef? Seeing as how you're not using the p-values, it should do exactly what you want, with as little fuss as possible. (Unless I'm mis-remembering exactly what pearson's R is, which is quite possible.)
Just quickly checking the results on random data, it returns exactly the same thing as #Justin Peel's code above and runs ~100x faster.
For example, testing things with 1000 rows and 10 columns of random data...:
import numpy as np
import scipy as sp
import scipy.stats
def main():
data = np.random.random((1000, 10))
x = corrcoef_test(data)
y = justin_peel_test(data)
print 'Maximum difference between the two results:', np.abs((x-y)).max()
return data
def corrcoef_test(data):
"""Just using numpy's built-in function"""
return np.corrcoef(data)
def justin_peel_test(data):
"""Justin Peel's suggestion above"""
rows = data.shape[0]
r = np.zeros((rows,rows))
ms = data.mean(axis=1)
datam = np.zeros_like(data)
for i in xrange(rows):
datam[i] = data[i] - ms[i]
datass = sp.stats.ss(datam,axis=1)
for i in xrange(rows):
for j in xrange(i,rows):
r_num = np.add.reduce(datam[i]*datam[j])
r_den = np.sqrt(datass[i]*datass[j])
r[i,j] = min((r_num / r_den), 1.0)
r[j,i] = r[i,j]
return r
data = main()
Yields a maximum absolute difference of ~3.3e-16 between the two results
And timings:
In [44]: %timeit corrcoef_test(data)
10 loops, best of 3: 71.7 ms per loop
In [45]: %timeit justin_peel_test(data)
1 loops, best of 3: 6.5 s per loop
numpy.corrcoef should do just what you want, and it's a lot faster.
you can use the python multiprocess module, chunk up your rows into 10 sets, buffer your results and then print the stuff out (this would only speed it up on a multicore machine though)
http://docs.python.org/library/multiprocessing.html
btw: you'd also have to turn your snippet into a function and also consider how to do the data reassembly. having each subprocess have a list like this ...[startcord,stopcord,buff] .. might work nicely
def myfunc(thelist):
for i in xrange(thelist[0]:thelist[1]):
....
thelist[2] = result
A User will specify a time interval of n secs/mins/hours and then two times (start / stop).
I need to be able to take this interval, and then step through the start and stop times, in order to get a list of these times. Then after this, I will perform a database look up via a table.objects.filter, in order to retrieve the data corresponding to each time.
I'm making some ridiculously long algorithms at the moment and I'm positive there could be an easier way to do this. That is, a more pythonic way. Thoughts?
it fits nicely as a generator, too:
def timeseq(start,stop,interval):
while start <= stop:
yield start
start += interval
used as:
for t in timeseq(start,stop,interval):
table.objects.filter(t)
or:
data = [table.objects.filter(t) for t in timeseq(start,stop,interval)]
Are you looking for something like this? (pseudo code)
t = start
while t != stop:
t += interval
table.objects.filter(t)
What about ...
result = RelevantModel.objects.filter(relavant_field__in=[
start + interval * i
for i in xrange((start - end).seconds / interval.seconds)
])
... ?
I can't imagine this is very different from what you're already doing, but perhaps it's more compact (particularly if you weren't using foo__in=[bar] or a list comprehension). Of course start and end would be datetime.datetime objects and interval would be a datetime.timedelta object.