I am looking to use option #2 to get the result of option #1.
import pandas as pd
df=pd.DataFrame(np.arange(50), columns=['A'])
def test(x):
v=30
if v>x:
return(x)
#option 1
df['A'].apply(lambda x: test(x))
#option 2
test(df['A'])
The error message that I get when I run your code says:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all().
The problem is that a is an array and v a single value, so there is no single truth value in the comparison. If you intention is to check if v is greater than all numbers in a, use np.all(v>a). If you want to check if v is greater than just some use np.any(v>a).
On Edit
You have now edited your question so much that it is now a new question. The entire point of the apply method is that if f is a Python function and v is a numpy array, then f(v) is probably not the array that you would get by applying f to the elements of v. Python is not a language that directly supports vectorized calculations. The reason that it sometimes seems that computations in numpy or pandas are as easy to vectorize as similar calculations in e.g. R is because of the way Python's duck-typing works. If a class defines the magic method __add__ then you can use + to add elements of that class to each other in any way that you want. This is exactly what the people who created numpy have done (as well as other magic methods for things like *,/,< etc.) So, if a function definition is something like def f(x): return x*x + 2*x + 3 where all the computational steps correspond to magic methods, then v.apply(f) and f(v) will work the same. Your test function uses the keyword if. There is not a magic method which can convert that part of the core language into something else.
Related
Basically, what I'm trying to create is a function which takes an array, in this case:
numpy.linspace(0, 0.2, 100)
and runs a lot of other code for each of the elements in the array and at the end creates a new array with one a number for each of the calculations for each element. A simple example would be that the function is doing a multiplication like this:
def func(x):
y = x * 10
return (y)
However, I want it to be able to take an array as an argument and return an array consisting of each y for each multiplication. The function above works for this, but the one I've tried creating for my code doesn't work with this method and only returns one value instead. Is there another way to make the function work as intended? Thanks for the help!
You could use this simple code:
def func(x):
y = []
for i in x:
y.append(i*10)
return y
Maybe take a look at np.vectorize:
https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.vectorize.html
np.vectorize can for example be used as a decorator:
#np.vectorize
def func(value):
...
return return_value
The function to be vectorized (here func) has to be a function,
that takes a value as input and returns a value.
This function then gets vectorized over the whole array.
It is mentioned in the documentation, but it cant hurt to emphasize it here:
In general this function is only used for convenience not for performance,
it is basically equivalent to using a for-loop.
If you are able to build up your function from numpys ufuncs like (np.add, np.mean, etc.) this will likely be much faster.
Or you could write your own:
https://docs.scipy.org/doc/numpy-1.13.0/reference/ufuncs.html
You can do this with numpy already with your function. For example, the code below will do what you want:
x = numpy.linspace(0, 0.2, 100)
y = x*10
If you defined x as above and passed it to your function it would perform exactly as you want.
Python has a built in functionality for checking the validity of entire slices: slice.indices. Is there something similar that is built-in for individual indices?
Specifically, I have an index, say a = -2 that I wish to normalize with respect to a 4-element list. Is there a method that is equivalent to the following already built in?
def check_index(index, length):
if index < 0:
index += length
if index < 0 or index >= length:
raise IndexError(...)
My end result is to be able to construct a tuple with a single non-None element. I am currently using list.__getitem__ to do the check for me, but it seems a little awkward/overkill:
items = [None] * 4
items[a] = 'item'
items = tuple(items)
I would like to be able to do
a = check_index(a, 4)
items = tuple('item' if i == a else None for i in range(4))
Everything in this example is pretty negotiable. The only things that are fixed is that I am getting a in a way that can have all of the problems that an arbitrary index can have and that the final result has to be a tuple.
I would be more than happy if the solution used numpy and only really applied to numpy arrays instead of Python sequences. Either one would be perfect for the application I have in mind.
If I understand correctly, you can use range(length)[index], in your example range(4)[-2]. This properly handles negative and out-of-bounds indices. At least in recent versions of Python, range() doesn't literally create a full list so this will have decent performance even for large arguments.
If you have a large number of indices to do this with in parallel, you might get better performance doing the calculation with Numpy vectorized arithmetic, but I don't think the technique with range will work in that case. You'd have to manually do the calculation using the implementation in your question.
There is a function called numpy.core.multiarray.normalize_axis_index which does exactly what I need. It is particularly useful to be because the implementation I had in mind was for numpy array indexing:
from numpy.core.multiarray import normalize_axis_index
>>> normalize_axis_index(3, 4)
3
>>> normalize_axis_index(-3, 4)
1
>>> normalize_axis_index(-5, 4)
...
numpy.core._internal.AxisError: axis -5 is out of bounds for array of dimension 4
The function was added in version 1.13.0. The source for this function is available here, and the documentation source is here.
Say I have two numpy arrays of the same dimensions, e.g.:
a = np.ones((4,))
b = np.linspace(0,4,4)
and a function that is supposed to operate on elements of those arrays:
def my_func (x,y):
# do something, e.g.
z = x+y
return z
How can I apply this function to the elements of a and b in an element-wise fashion and get the result back?
It depends, really. For the given function; how about 'a+b', for instance? Presumably you have something more complex in mind though.
The most general solution is np.vectorize; but its also the slowest. Depending on what you want to do, more clever solutions may exist though. Take a look at numexp for example.
I have a dynamic programming algorithm (modified Needleman-Wunsch) which requires the same basic calculation twice, but the calculation is done in the orthogonal direction the second time. For instance, from a given cell (i,j) in matrix scoreMatrix, I want to both calculate a value from values "up" from (i,j), as well as a value from values to the "left" of (i,j). In order to reuse the code I have used a function in which in the first case I send in parameters i,j,scoreMatrix, and in the next case I send in j,i,scoreMatrix.transpose(). Here is a highly simplified version of that code:
def calculateGapCost(i,j,scoreMatrix,gapcost):
return scoreMatrix[i-1,j] - gapcost
...
gapLeft = calculateGapCost(i,j,scoreMatrix,gapcost)
gapUp = calculateGapCost(j,i,scoreMatrix.transpose(),gapcost)
...
I realized that I could alternatively send in a function that would in the one case pass through arguments (i,j) when retrieving a value from scoreMatrix, and in the other case reverse them to (j,i), rather than transposing the matrix each time.
def passThrough(i,j,matrix):
return matrix[i,j]
def flipIndices(i,j,matrix):
return matrix[j,i]
def calculateGapCost(i,j,scoreMatrix,gapcost,retrieveValue):
return retrieveValue(i-1,j,scoreMatrix) - gapcost
...
gapLeft = calculateGapCost(i,j,scoreMatrix,gapcost,passThrough)
gapUp = calculateGapCost(j,i,scoreMatrix,gapcost,flipIndices)
...
However if numpy transpose uses some features I'm unaware of to do the transpose in just a few operations, it may be that transpose is in fact faster than my pass-through function idea. Can anyone tell me which would be faster (or if there is a better method I haven't thought of)?
The actual method would call retrieveValue 3 times, and involves 2 matrices that would be referenced (and thus transposed if using that approach).
In NumPy, transpose returns a view with a different shape and strides. It does not touch the data.
Therefore, you will likely find that the two approaches have identical performance, since in essence they are exactly the same.
However, the only way to be sure is to benchmark both.
If I wanted to apply a function row-wise (or column-wise) to an ndarray, do I look to ufuncs (doesn't seem like it) or some type of array broadcasting (not what I'm looking for either?) ?
Edit
I am looking for something like R's apply function. For instance,
apply(X,1,function(x) x*2)
would multiply 2 to each row of X through an anonymously defined function, but could also be a named function. (This is of course a silly, contrived example in which apply is not actually needed). There is no generic way to apply a function across an NumPy array's "axis", ?
First off, many numpy functions take an axis argument. It's probably possible (and better) to do what you want with that sort of approach.
However, a generic "apply this function row-wise" approach would look something like this:
import numpy as np
def rowwise(func):
def new_func(array2d, **kwargs):
# Run the function once to determine the size of the output
val = func(array2d[0], **kwargs)
output_array = np.zeros((array2d.shape[0], val.size), dtype=val.dtype)
output_array[0] = val
for i,row in enumerate(array2d[1:], start=1):
output_array[i] = func(row, **kwargs)
return output_array
return new_func
#rowwise
def test(data):
return np.cumsum(data)
x = np.arange(20).reshape((4,5))
print test(x)
Keep in mind that we can do exactly the same thing with just:
np.cumsum(x, axis=1)
There's often a better way that the generic approach, especially with numpy.
Edit:
I completely forgot about it, but the above is essentially equivalent to numpy.apply_along_axis.
So, we could re-write that as:
import numpy as np
def test(row):
return np.cumsum(row)
x = np.arange(20).reshape((4,5))
print np.apply_along_axis(test, 1, x)