Best way to pass repeated parameter to a Numpy vectorized function - python

So, continuing from the discussion #TheBlackCat and I were having in this answer, I would like to know the best way to pass arguments to a Numpy vectorized function. The function in question is defined thus:
vect_dist_funct = np.vectorize(lambda p1, p2: vincenty(p1, p2).meters)
where, vincenty comes from the Geopy package.
I currently call vect_dist_funct in this manner:
def pointer(point, centroid, tree_idx):
intersect = list(tree_idx.intersection(point))
if len(intersect) > 0:
points = pd.Series([point]*len(intersect)).values
polygons = centroid.loc[intersect].values
dist = vect_dist_funct(points, polygons)
return pd.Series(dist, index=intercept, name='Dist').sort_values()
else:
return pd.Series(np.nan, index=[0], name='Dist')
points['geometry'].apply(lambda x: pointer(point=x.coords[0], centroid=line['centroid'], tree_idx=tree_idx))
(Please refer to the question here: Labelled datatypes Python)
My question pertains to what happens inside the function pointer. The reason I am converting points to a pandas.Series and then getting the values (in the 4th line, just under the if statement) is to make it in the same shape as polygons. If I merely call points either as points = [point]*len(intersect) or as points = itertools.repeat(point, len(intersect)), Numpy complains that it "cannot broadcast arrays of size (n,2) and size (n,) together" (n is the length of intersect).
If I call vect_dist_funct like so: dist = vect_dist_funct(itertools.repeat(points, len(intersect)), polygons), vincenty complains that I have passed it too many arguments. I am at a complete loss to understand the difference between the two.
Note that these are coordinates, therefore will always be in pairs. Here are examples of how point and polygons look like:
point = (-104.950752 39.854744) # Passed directly to the function like this.
polygons = array([(-104.21750802451864, 37.84052458697633),
(-105.01017084789603, 39.82012158954065),
(-105.03965315742742, 40.669867471420886),
(-104.90353460825702, 39.837631505433706),
(-104.8650601872832, 39.870796282334744)], dtype=object)
# As returned by statement centroid.loc[intersect].values
What is the best way to call vect_dist_funct in this circumstance, such that I can have a vectorized call, and both Numpy and vincenty will not complain that I am passing wrong arguments? Also, techniques that result in minimum memory consumption, and increased speed are sought. The goal is to compute distance between the point to each polygon centroid.

np.vectorize doesn't really help you here. As per the documentation:
The vectorize function is provided primarily for convenience, not for performance. The implementation is essentially a for loop.
In fact, vectorize actively hurts you, since it converts the inputs into numpy arrays, doing an unnecessary and expensive type conversion and producing the errors you are seeing. You are much better off using a function with a for loop.
It also is better to use a function rather than a lambda for a to-level function, since it lets you have a docstring.
So this is how I would implement what you are doing:
def vect_dist_funct(p1, p2):
"""Apply `vincenty` to `p1` and each element of `p2`.
Iterate over `p2`, returning `vincenty` with the first argument
as `p1` and the second as the current element of `p2`. Returns
a numpy array where each row is the result of the `vincenty` function
call for the corresponding element of `p2`.
"""
return [vincenty(p1, p2i).meters for p2i in p2]
If you really want to use vectorize, you can use the excluded argument to not vectorize the p1 argument, or better yet set up a lambda that wraps vincenty and only vectorizes the second argument:
def vect_dist_funct(p1, p2):
"""Apply `vincenty` to `p1` and each element of `p2`.
Iterate over `p2`, returning `vincenty` with the first argument
as `p1` and the second as the current element of `p2`. Returns
a list where each value is the result of the `vincenty` function
call for the corresponding element of `p2`.
"""
vinc_p = lambda x: vincenty(p1, x)
return np.vectorize(vinc_p)(p2)

Related

Solving a 2-Variable function to only 1 of the variable doesnt work on arrays

I have found from StackOverflow how to solve a fucntion with two variables giving the one as constant/known.
This is the part of the code:
def R(gg,a):
return a-r0*g0**(1/2)*D(gg)/gg**(1/2)
def G(r):
partial_func = functools.partial(R, a=r)
return fsolve(partial_func,10,xtol=10**-1)
and it works, since for the first 2 prints, I get the same value
f=([10,15])
print(G(10))
print(G(f[0]))
print(G(f))
but when giving the full array it has the following error:
The array returned by a function changed size between calls
It looks like you are trying to find the roots of R for different values included in f.
The problem is that partial_func has an single value as starting estimate and wants to return an array of the same length as a (in your case 2 values).
In other words, there is not a single value root to your problem. For example the root for f[0]=10 is probably different from the root for f[1]=15. The solution should be an array of two values in this case.
To fix this, you need to give an array for the x0 (starting estimate) parameter of fsolve.
def G(r):
partial_func = functools.partial(R, a=r)
return fsolve(partial_func,[10,10],xtol=10**-1)
So that for each values in a there is a initializer for gg and the solution is a vector of the same length as f.
So reading this w/o knowing all parameters used in the function i would say that in case of print(G(f)) you provide a scalar and return an array, which does not work.
Try calling your function with a=f and look at the returned value.
The docs state:
fsolve: func: A function that takes at least one (possibly vector) argument, and returns a value of the same length

Creating a function in Python which runs over a range and returns a new value to an array each time

Basically, what I'm trying to create is a function which takes an array, in this case:
numpy.linspace(0, 0.2, 100)
and runs a lot of other code for each of the elements in the array and at the end creates a new array with one a number for each of the calculations for each element. A simple example would be that the function is doing a multiplication like this:
def func(x):
y = x * 10
return (y)
However, I want it to be able to take an array as an argument and return an array consisting of each y for each multiplication. The function above works for this, but the one I've tried creating for my code doesn't work with this method and only returns one value instead. Is there another way to make the function work as intended? Thanks for the help!
You could use this simple code:
def func(x):
y = []
for i in x:
y.append(i*10)
return y
Maybe take a look at np.vectorize:
https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.vectorize.html
np.vectorize can for example be used as a decorator:
#np.vectorize
def func(value):
...
return return_value
The function to be vectorized (here func) has to be a function,
that takes a value as input and returns a value.
This function then gets vectorized over the whole array.
It is mentioned in the documentation, but it cant hurt to emphasize it here:
In general this function is only used for convenience not for performance,
it is basically equivalent to using a for-loop.
If you are able to build up your function from numpys ufuncs like (np.add, np.mean, etc.) this will likely be much faster.
Or you could write your own:
https://docs.scipy.org/doc/numpy-1.13.0/reference/ufuncs.html
You can do this with numpy already with your function. For example, the code below will do what you want:
x = numpy.linspace(0, 0.2, 100)
y = x*10
If you defined x as above and passed it to your function it would perform exactly as you want.

iterate over two numpy arrays return 1d array

I often have a function that returns a single value such as a maximum or integral. I then would like to iterate over another parameter. Here is a trivial example using a parabolic. I don't think its broadcasting since I only want the 1D array. In this case its maximums. A real world example is the maximum power point of a solar cell as a function of light intensity but the principle is the same as this example.
import numpy as np
x = np.linspace(-1,1) # sometimes this is read from file
parameters = np.array([1,12,3,5,6])
maximums = np.zeros_like(parameters)
for idx, parameter in enumerate(parameters):
y = -x**2 + parameter
maximums[idx] = np.max(y) # after I have the maximum I don't need the rest of the data.
print(maximums)
What is the best way to do this in Python/Numpy? I know one simplification is to make the function a def and then use np.vectorize but my understanding is it doesn't make the code any faster.
Extend one of those arrays to 2D and then let broadcasting do those outer additions in a vectorized way -
maximums = (-x**2 + parameters[:,None]).max(1).astype(parameters.dtype)
Alternatively, with the explicit use of the outer addition method -
np.add.outer(parameters, -x**2).max(1).astype(parameters.dtype)

Defining a function of two variables, where one of the variables is integrated (scipy integrate.quad)

I am having trouble creating a function with two variables and three parameters. I want to perform a definite (numerical) integral over one of the variables (say t), and have it spit out an array F1(x;a,b,c), i.e. an array with a value associated with each entry in x, with scalar parameters a, b, and c. Ultimately I will need to fit the parameters (a,b,c) to data using leastsq, which I have done before using simpler functions.
Code looks like this:
def H1(t,x,a,b,c): #integrand
return (a function of the above, with parameters a,b,c, dummy variable to be integrated from 0 to inf t, and x)
def F1(x,a,b,c): #integrates H1: 0<t<inf
f_int1 = integrate.quad(H1,0.,np.inf,args=(x,a,b,c)) #integrating t from 0 to inf, x is going to be an element of the array in x_data.
return f_int1
Now, for example if I try to use F1 as a function:
F1(x_data,70.,.05,.1) #where x_data is an array of real numbers, between 0 and 500
I get the message:
quadpack.error: Supplied function does not return a valid float
I am hoping it will spit out an array: F1 for all the entries in x_data. If I just use a single scalar value for the first input into F1, e.g.:
F1(x_data[4],70.,.05,.1)
It spits out two numbers, which are the value of F1 at that point and the error tolerance. This looks like part of what I want, but I think I need it to work when passing an array through. So: it works for passing a single scalar value, but I need it to accept an array (and therefore make an array).
I am guessing the problem lies when I am trying to pass an array through the function as an argument. Though I am not sure what is a better way to do this? I think I have to figure out a way to do it as a function, since I will be using leastsq in the next few lines of code. (I know how to use leastsq, I think!)
Anyone have any ideas on how to get around this?
scipy.integrate.quad does not accept array-valued functions. Your best bet is to have a loop over the components (possibly with syntactic sugar of numpy.vectorize).

Which is faster, numpy transpose or flip indices?

I have a dynamic programming algorithm (modified Needleman-Wunsch) which requires the same basic calculation twice, but the calculation is done in the orthogonal direction the second time. For instance, from a given cell (i,j) in matrix scoreMatrix, I want to both calculate a value from values "up" from (i,j), as well as a value from values to the "left" of (i,j). In order to reuse the code I have used a function in which in the first case I send in parameters i,j,scoreMatrix, and in the next case I send in j,i,scoreMatrix.transpose(). Here is a highly simplified version of that code:
def calculateGapCost(i,j,scoreMatrix,gapcost):
return scoreMatrix[i-1,j] - gapcost
...
gapLeft = calculateGapCost(i,j,scoreMatrix,gapcost)
gapUp = calculateGapCost(j,i,scoreMatrix.transpose(),gapcost)
...
I realized that I could alternatively send in a function that would in the one case pass through arguments (i,j) when retrieving a value from scoreMatrix, and in the other case reverse them to (j,i), rather than transposing the matrix each time.
def passThrough(i,j,matrix):
return matrix[i,j]
def flipIndices(i,j,matrix):
return matrix[j,i]
def calculateGapCost(i,j,scoreMatrix,gapcost,retrieveValue):
return retrieveValue(i-1,j,scoreMatrix) - gapcost
...
gapLeft = calculateGapCost(i,j,scoreMatrix,gapcost,passThrough)
gapUp = calculateGapCost(j,i,scoreMatrix,gapcost,flipIndices)
...
However if numpy transpose uses some features I'm unaware of to do the transpose in just a few operations, it may be that transpose is in fact faster than my pass-through function idea. Can anyone tell me which would be faster (or if there is a better method I haven't thought of)?
The actual method would call retrieveValue 3 times, and involves 2 matrices that would be referenced (and thus transposed if using that approach).
In NumPy, transpose returns a view with a different shape and strides. It does not touch the data.
Therefore, you will likely find that the two approaches have identical performance, since in essence they are exactly the same.
However, the only way to be sure is to benchmark both.

Categories