interp1d with streaming data?

interp1d with streaming data? - python

I know that you can pass the data lists x and y to scipy's interp1d by reference. Does this mean I can add new data to it by simply modifying the inputs x and y in-place?
Ideally, I'm looking for something that will do the following efficiently:
It interpolates the value for some point we request.
Based on that interpolated value, we decide whether or not to obtain the 'true' value.
If the true value is obtained, we then want to update the algorithm's knowledge for future interpolations.
However, once the values are put into the interpolating algorithm, they are not presumed to change--new points can only be added. I think interp1d does some kind of fancy processing on the input data to make the lookup faster, but I'm not sure if that precludes adding to the data in-place. Please help!
Edit: Some of you will likely notice that this has a lot in common with Metropolis-Hastings, however, steps 1-3 may not occur serially; hence I need a more abstract interpolation method to support asynchronous updates. If you know of any, suggestions, that would be great!

I think the simplest is to write your own interpolating object:
class Interpolator:
def __init__(self,x,y):
if len(x)!=len(y):
raise BaseException("Lists must have the same length")
self.xlist=x
self.ylist=y
self.len=len(x)
def find_x_index(self,x0): # find index i such that xlist[i]<=x0<xlist[i+1]
a,b=0,self.len-1
while b-a>1:
m=int((b+a)/2)
if x0<self.xlist[m]:
b=m
else:
a=m
return a
def add_point(self,x,y): # add a new point
if x<self.xlist[0]:
self.xlist.insert(0,x)
self.ylist.insert(0,y)
elif x>self.xlist[-1]:
self.xlist.append(x)
self.ylist.append(y)
else:
i=self.find_x_index(x)
self.xlist.insert(i+1,x)
self.ylist.insert(i+1,y)
self.len+=1
def interpolate(self,x0): # interpolates y value for x0
if x0<self.xlist[0] or x0>self.xlist[-1]:
raise BaseException("Value out of range")
a=self.find_x_index(x0)
eps=(x0-self.xlist[a])/(self.xlist[a+1]-self.xlist[a]) # interpolation
return (eps*self.ylist[a+1]+(1-eps)*self.ylist[a])
itp=Interpolator([1,2,3],[1,3,4])
print(itp.interpolate(1.6))
itp.add_point(1.5,3)
print(itp.interpolate(1.6))
The key point is to always keep the x list sorted, so that you can use dichotomy which is a logarithmic complexity algorithm.
Remark: in add_point, you should check that there aren't two same x values with different y

Related

Best way to pass repeated parameter to a Numpy vectorized function

So, continuing from the discussion #TheBlackCat and I were having in this answer, I would like to know the best way to pass arguments to a Numpy vectorized function. The function in question is defined thus:
vect_dist_funct = np.vectorize(lambda p1, p2: vincenty(p1, p2).meters)
where, vincenty comes from the Geopy package.
I currently call vect_dist_funct in this manner:
def pointer(point, centroid, tree_idx):
intersect = list(tree_idx.intersection(point))
if len(intersect) > 0:
points = pd.Series([point]*len(intersect)).values
polygons = centroid.loc[intersect].values
dist = vect_dist_funct(points, polygons)
return pd.Series(dist, index=intercept, name='Dist').sort_values()
else:
return pd.Series(np.nan, index=[0], name='Dist')
points['geometry'].apply(lambda x: pointer(point=x.coords[0], centroid=line['centroid'], tree_idx=tree_idx))
(Please refer to the question here: Labelled datatypes Python)
My question pertains to what happens inside the function pointer. The reason I am converting points to a pandas.Series and then getting the values (in the 4th line, just under the if statement) is to make it in the same shape as polygons. If I merely call points either as points = [point]*len(intersect) or as points = itertools.repeat(point, len(intersect)), Numpy complains that it "cannot broadcast arrays of size (n,2) and size (n,) together" (n is the length of intersect).
If I call vect_dist_funct like so: dist = vect_dist_funct(itertools.repeat(points, len(intersect)), polygons), vincenty complains that I have passed it too many arguments. I am at a complete loss to understand the difference between the two.
Note that these are coordinates, therefore will always be in pairs. Here are examples of how point and polygons look like:
point = (-104.950752 39.854744) # Passed directly to the function like this.
polygons = array([(-104.21750802451864, 37.84052458697633),
(-105.01017084789603, 39.82012158954065),
(-105.03965315742742, 40.669867471420886),
(-104.90353460825702, 39.837631505433706),
(-104.8650601872832, 39.870796282334744)], dtype=object)
# As returned by statement centroid.loc[intersect].values
What is the best way to call vect_dist_funct in this circumstance, such that I can have a vectorized call, and both Numpy and vincenty will not complain that I am passing wrong arguments? Also, techniques that result in minimum memory consumption, and increased speed are sought. The goal is to compute distance between the point to each polygon centroid.

np.vectorize doesn't really help you here. As per the documentation:
The vectorize function is provided primarily for convenience, not for performance. The implementation is essentially a for loop.
In fact, vectorize actively hurts you, since it converts the inputs into numpy arrays, doing an unnecessary and expensive type conversion and producing the errors you are seeing. You are much better off using a function with a for loop.
It also is better to use a function rather than a lambda for a to-level function, since it lets you have a docstring.
So this is how I would implement what you are doing:
def vect_dist_funct(p1, p2):
"""Apply `vincenty` to `p1` and each element of `p2`.
Iterate over `p2`, returning `vincenty` with the first argument
as `p1` and the second as the current element of `p2`. Returns
a numpy array where each row is the result of the `vincenty` function
call for the corresponding element of `p2`.
"""
return [vincenty(p1, p2i).meters for p2i in p2]
If you really want to use vectorize, you can use the excluded argument to not vectorize the p1 argument, or better yet set up a lambda that wraps vincenty and only vectorizes the second argument:
def vect_dist_funct(p1, p2):
"""Apply `vincenty` to `p1` and each element of `p2`.
Iterate over `p2`, returning `vincenty` with the first argument
as `p1` and the second as the current element of `p2`. Returns
a list where each value is the result of the `vincenty` function
call for the corresponding element of `p2`.
"""
vinc_p = lambda x: vincenty(p1, x)
return np.vectorize(vinc_p)(p2)

Defining a function of two variables, where one of the variables is integrated (scipy integrate.quad)

I am having trouble creating a function with two variables and three parameters. I want to perform a definite (numerical) integral over one of the variables (say t), and have it spit out an array F1(x;a,b,c), i.e. an array with a value associated with each entry in x, with scalar parameters a, b, and c. Ultimately I will need to fit the parameters (a,b,c) to data using leastsq, which I have done before using simpler functions.
Code looks like this:
def H1(t,x,a,b,c): #integrand
return (a function of the above, with parameters a,b,c, dummy variable to be integrated from 0 to inf t, and x)
def F1(x,a,b,c): #integrates H1: 0<t<inf
f_int1 = integrate.quad(H1,0.,np.inf,args=(x,a,b,c)) #integrating t from 0 to inf, x is going to be an element of the array in x_data.
return f_int1
Now, for example if I try to use F1 as a function:
F1(x_data,70.,.05,.1) #where x_data is an array of real numbers, between 0 and 500
I get the message:
quadpack.error: Supplied function does not return a valid float
I am hoping it will spit out an array: F1 for all the entries in x_data. If I just use a single scalar value for the first input into F1, e.g.:
F1(x_data[4],70.,.05,.1)
It spits out two numbers, which are the value of F1 at that point and the error tolerance. This looks like part of what I want, but I think I need it to work when passing an array through. So: it works for passing a single scalar value, but I need it to accept an array (and therefore make an array).
I am guessing the problem lies when I am trying to pass an array through the function as an argument. Though I am not sure what is a better way to do this? I think I have to figure out a way to do it as a function, since I will be using leastsq in the next few lines of code. (I know how to use leastsq, I think!)
Anyone have any ideas on how to get around this?

scipy.integrate.quad does not accept array-valued functions. Your best bet is to have a loop over the components (possibly with syntactic sugar of numpy.vectorize).

Efficiently recalculating the gradient of a numpy array with unknown dimensionality

I have an N-dimensional numpy array S. Every iteration, exactly one value in this array will change.
I have a second array, G that stores the gradient of S, as calculated by numpy's gradient() function. Currently, my code unnecessarily recalculates all of G every time I update S, but this is unnecessary, as only one value in S has changed, and so I only should have to recalculate 1+d*2 values in G, where d is the number of dimensions in S.
This would be an easier problem to solve if I knew the dimensionality of the arrays, but the solutions I have come up with in the absence of this knowledge have been quite inefficient (not substantially better than just recalculating all of G).
Is there an efficient way to recalculate only the necessary values in G?
Edit: adding my attempt, as requested
The function returns a vector indicating the gradient of S at coords in each dimension. It calculates this without calculating the gradient of S at every point, but the problem is that it does not seem to be very efficient.
It looks similar in some ways to the answers already posted, but maybe there is something quite inefficient about it?
The idea is the following: I iterate through each dimension, creating a slice that is a vector only in that dimension. For each of these slices, I calculate the gradient and place the appropriate value from that gradient into the correct place in the returned vector grad.
The use of min() and max() is to deal with the boundary conditions.
def getSGradAt(self,coords) :
"""Returns the gradient of S at position specified by
the vector argument 'coords'.
self.nDim : the number of dimensions of S
self.nBins : the width of S (same in every dim)
self.s : S """
grad = zeros(self.nDim)
for d in xrange(self.nDim) :
# create a slice through S that has size > 1 only in the current
# dimension, d.
slices = list(coords)
slices[d] = slice(max(0,coords[d]-1),min(self.nBins,coords[d]+2))
# take the middle value from the gradient vector
grad[d] = gradient(self.s[sl])[1]
return grad
The problem is that this doesn't run very quickly. In fact, just taking the gradient of the whole array S seems to run faster (for nBins = 25 and nDim = 4).
Edited again, to add my final solution
Here is what i ended up using. This function updates S, changing the value at X by the amount change. It then updates G using a variation on the technique proposed by Jaime.
def changeSField(self,X,change) :
# change s
self.s[X] += change
# update g (gradient field)
slices = tuple(slice(None if j-2 <= 0 else j-2, j+3, 1) for j in X)
newGrads = gradient(self.s[slices])
for i in arange(self.nDim) :
self.g[i][slices] = newGrads[i]

Your question is much to open for you to get a good answer: it is always a good idea to post your inefficient code, so that potential answerers can better help you. Anyway, lets say you know the coordinates of the point that has changed, and that you store those in a tuple named coords. First, lets construct a tuple of slices encompassing your point:
slices = tuple(slice(None if j-1 <= 0 else j-1, j+2, 1) for j in coords)
You may want to extend the limits to j-2 and j+3 so that the gradient is calculated using central differences whenever possible, but it will be slower.
You can now update you array doing something like:
G[slices] = np.gradient(N[slices])

Uhmmm, I could work better if I had an example, but what about just creating a secondary array, S2 (by the way, I'd choose longer and more meaningful names for your variables) and recalculate the gradient for it, G2, and then introduce it back into G?
Another question is: if you don't know the dimensionality of S, how are you changing the particular element that changes? Are you just recalculating the whole of S?
I suggest you clarify this things so that people can help you better.
Cheers!

Which is faster, numpy transpose or flip indices?

I have a dynamic programming algorithm (modified Needleman-Wunsch) which requires the same basic calculation twice, but the calculation is done in the orthogonal direction the second time. For instance, from a given cell (i,j) in matrix scoreMatrix, I want to both calculate a value from values "up" from (i,j), as well as a value from values to the "left" of (i,j). In order to reuse the code I have used a function in which in the first case I send in parameters i,j,scoreMatrix, and in the next case I send in j,i,scoreMatrix.transpose(). Here is a highly simplified version of that code:
def calculateGapCost(i,j,scoreMatrix,gapcost):
return scoreMatrix[i-1,j] - gapcost
...
gapLeft = calculateGapCost(i,j,scoreMatrix,gapcost)
gapUp = calculateGapCost(j,i,scoreMatrix.transpose(),gapcost)
...
I realized that I could alternatively send in a function that would in the one case pass through arguments (i,j) when retrieving a value from scoreMatrix, and in the other case reverse them to (j,i), rather than transposing the matrix each time.
def passThrough(i,j,matrix):
return matrix[i,j]
def flipIndices(i,j,matrix):
return matrix[j,i]
def calculateGapCost(i,j,scoreMatrix,gapcost,retrieveValue):
return retrieveValue(i-1,j,scoreMatrix) - gapcost
...
gapLeft = calculateGapCost(i,j,scoreMatrix,gapcost,passThrough)
gapUp = calculateGapCost(j,i,scoreMatrix,gapcost,flipIndices)
...
However if numpy transpose uses some features I'm unaware of to do the transpose in just a few operations, it may be that transpose is in fact faster than my pass-through function idea. Can anyone tell me which would be faster (or if there is a better method I haven't thought of)?
The actual method would call retrieveValue 3 times, and involves 2 matrices that would be referenced (and thus transposed if using that approach).

In NumPy, transpose returns a view with a different shape and strides. It does not touch the data.
Therefore, you will likely find that the two approaches have identical performance, since in essence they are exactly the same.
However, the only way to be sure is to benchmark both.

Calculating cartesian coordinates using python

Having not worked with cartesian graphs since high school, I have actually found a need for them relevant to real life. It may be a strange need, but I have to allocate data to points on a cartesian graph, that will be accessible by calling cartesian coordinates. There needs to be infinite points on the graphs. For Eg.
^
[-2-2,a ][ -1-2,f ][0-2,k ][1-2,p][2-2,u]
[-2-1,b ][ -1-1,g ][0-1,l ][1-1,q][1-2,v]
<[-2-0,c ][ -1-0,h ][0-0,m ][1-0,r][2-0,w]>
[-2--1,d][-1--1,i ][0--1,n][1-1,s][2-1,x]
[-2--2,e][-1--2,j ][0--2,o][1-2,t][2-2,y]
v
The actual values aren't important. But, say I am on variable m, this would be 0-0 on the cartesian graph. I need to calculate the cartesian coordinates for if I moved up one space, which would leave me on l.
Theoretically, say I have a python variable which == ("0-1"), I believe I need to split it at the -, which would leave x=0, y=1. Then, I would need to perform (int(y)+1), then re-attach x to y with a '-' in between.
What I want to be able to do is call a function with the argument (x+1,y+0), and for the program to perform the above, and then return the cartesian coordinate it has calculated.
I don't actually need to retrieve the value of the space, just the cartesian coordinate. I imagine I could utilise re.sub(), however I am not sure how to format this function correctly to split around the '-', and I'm also not sure how to perform the calculation correctly.
How would I do this?

To represent an infinite lattice, use a dictionary which maps tuples (x,y) to values.
grid[(0,0)] = m
grid[(0,1)] = l
print(grid[(0,0)])

I'm not sure I fully understand the problem but I would suggest using a list of lists to get the 2D structure.
Then to look up a particular value you could do coords[x-minX][y-minY] where x,y are the integer indices you want, and minX and minY are the minimum values (-2 in your example).
You might also want to look at NumPy which provides an n-dim object array type that is much more flexible, allowing you to 'slice' each axis or get subranges. The NumPy documentation might be helpful if you are new to working with arrays like this.
EDIT:
To split a string like 0-1 into the constituent integers you can use:
s = '0-1'
[int(x) for x in s.split('-')]

You want to create a bidirectional mapping between the variable names and the coordinates, then you can look up coordinates by variable name, apply your function to it, then find the next variable using the new set of coordinates produced by your function.
Mapping between numeric tuples you can apply your function to, and strings usable as keys in a dict, and back, is easy.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

interp1d with streaming data? - python

Related

Best way to pass repeated parameter to a Numpy vectorized function

Defining a function of two variables, where one of the variables is integrated (scipy integrate.quad)

Efficiently recalculating the gradient of a numpy array with unknown dimensionality

Which is faster, numpy transpose or flip indices?

Calculating cartesian coordinates using python

Categories

Resources