Filter rows of a numpy array? - python

I am looking to apply a function to each row of a numpy array. If this function evaluates to true I will keep the row, otherwise I will discard it. For example, my function might be:
def f(row):
if sum(row)>10: return True
else: return False
I was wondering if there was something similar to:
np.apply_over_axes()
which applies a function to each row of a numpy array and returns the result. I was hoping for something like:
np.filter_over_axes()
which would apply a function to each row of an numpy array and only return rows for which the function returned true. Is there anything like this? Or should I just use a for loop?

Ideally, you would be able to implement a vectorized version of your function and use that to do boolean indexing. For the vast majority of problems this is the right solution. Numpy provides quite a few functions that can act over various axes as well as all the basic operations and comparisons, so most useful conditions should be vectorizable.
import numpy as np
x = np.random.randn(20, 3)
x_new = x[np.sum(x, axis=1) > .5]
If you are absolutely sure that you can't do the above, I would suggest using a list comprehension (or np.apply_along_axis) to create an array of bools to index with.
def myfunc(row):
return sum(row) > .5
bool_arr = np.array([myfunc(row) for row in x])
x_new = x[bool_arr]
This will get the job done in a relatively clean way, but will be significantly slower than a vectorized version. An example:
x = np.random.randn(5000, 200)
%timeit x[np.sum(x, axis=1) > .5]
# 100 loops, best of 3: 5.71 ms per loop
%timeit x[np.array([myfunc(row) for row in x])]
# 1 loops, best of 3: 217 ms per loop

Related

Fast way to find indexes of nonzero entries for every row in a CSC matrix in Python

Here's the current implementation:
def nonzero_indexes_by_row(input):
return [
np.nonzero(row)[1]
for row in csr_matrix(input.T)
]
The matrix is very large(1.5M, 500K), since I'm accessing rows, I have to convert CSC to CSR first. The result would be a 2d list, each list contains a list of indexes that are nonzero corresponding to the row in the original matrix.
The current process takes 20 minutes. Is there a faster way?
Sure. You're pretty close to having an ideal solution, but you're allocating some unnecessary arrays. Here's a faster way:
from scipy import sparse
import numpy as np
def my_impl(csc):
csr = csc.tocsr()
return np.split(csr.indices, csr.indptr[1:-1])
def your_impl(input):
return [
np.nonzero(row)[1]
for row in sparse.csr_matrix(input)
]
## Results
# demo data
csc = sparse.random(15000, 5000, format="csc")
your_result = your_impl(csc)
my_result = my_impl(csc)
## Tests for correctness
# Same result
assert all(np.array_equal(x, y) for x, y in zip(your_result, my_result))
# Right number of rows
assert len(my_result) == csc.shape[0]
## Speed
%timeit my_impl(csc)
# 31 ms ± 1.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit your_impl(csc)
# 1.49 s ± 19.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Side question, why are you transposing the matrix? Wouldn't you then be getting the non-zero entries of the columns? If that's what you want, you don't even need to convert to csr and can just run:
np.split(csc.indices, csc.indptr[1:-1])
An interesting alternative to your code is to convert your array to
the COOrdinate format and then read its row and col attributes:
def nonzero_indices_by_coo(input):
cx = input.T.tocoo()
res = [ [] for i in range(cx.shape[0]) ]
for i, j in zip(cx.row, cx.col):
res[i].append(j)
return res
It returns a list of plain pythonic lists, instead of Numpy arrays,
but this should not be any important difference.
I noticed that your code uses internally transposition of the source array
(T operator) so I did the same in my code.
To compare execution speed, I created the following sparse array (2000 by 300):
r = 2000; c = 300
x = scipy.sparse.lil_matrix( (r,c) )
for _ in range(r):
x[np.random.randint(0,r-1), np.random.randint(0,c-1)] = np.random.randint(1,100)
and my code ran about 12 times faster than yours.
Yet quicker solution (in other format)
Or maybe it will be better to generate a 2-D (Numpy) array,
with 2 rows:
first row - row indices of consecutive non-zero elements,
second row - column indices.
To generate such result, you can use the following code:
def nonzero_indices_2d(input):
cx = input.T.tocoo()
return np.array([cx.row, cx.col])
which runs 4 times faster than my first solution.
Of course, then other parts of your code should be reworked, to consume
the indices given in another format.
Sparse arrays have also its own nonzero method:
arr.nonzero()
generating a 2-row Numpy array of indices. This function runs yet
a few percent faster.
So, assuming that the 2-D result format is acceptable (instead of
a list of lists), maybe you don't need any own function to get these
indices.
Another detail to consider: Whether (in all versions) there should be
used transposition.
Your choice, but without transposition each version
of code would run a bit faster.

Apply np.where against square bracket filtering for numpy filtering

I could perform filtering of numpy arrays via
a[np.where(a[:,0]==some_expression)]
or
a[a[:,0]==some_expression]
What are the (dis)advantages of each of these versions - especially with regard to performance?
Boolean indexing is transformed into integer indexing internally. This is indicated in the docs:
In general if an index includes a Boolean array, the result will be
identical to inserting obj.nonzero() into the same position and
using the integer array indexing mechanism described above.
So the complexity of the two approaches is the same. But np.where is more efficient for large arrays:
np.random.seed(0)
a = np.random.randint(0, 10, (10**7, 1))
%timeit a[np.where(a[:, 0] == 5)] # 50.1 ms per loop
%timeit a[a[:, 0] == 5] # 62.6 ms per loop
Now np.where has other benefits: advanced integer indexing works well across multiple dimensions. For an example where Boolean indexing is unintuitive in this aspect, see NumPy indexing: broadcasting with Boolean arrays. Since np.where is more efficient than Boolean indexing, this is just an extra reason it should be preferred.
To my surprise, the first one seems to perform slightly better:
a = np.random.random_integers(100, size=(1000,1))
import timeit
repeat = 3
numbers = 1000
def time(statement, _setup=None):
print(min(
timeit.Timer(statement, setup=_setup or setup).repeat(repeat, numbers)))
setup = """from __main__ import np, a"""
time('a[np.where(a[:,0]==99)]')
time('a[(a[:,0]==99)]')
prints (for instance):
0.017856399000000023
0.019185326999999974
Increasing the size of the array makes the numbers differ even more

Python Pandas, apply function

I am trying to use apply to avoid an iterrows() iterator in a function:
However that pandas method is poorly documented and I can't find example on how to use it, except for the lame .apply(sq.rt) in the documentation... No example on how to use arguments etc...
Anyway, here a toy example on what I try to do.
In my understanding apply will actually do the same as iterrows(), ie, iterate (over the rows if axis=0). On each iteration the input x of the function should be the row iterated over. However the error messages I keep receiving sort of disprove that assumption...
grid = np.random.rand(5,2)
df = pd.DataFrame(grid)
def multiply(x):
x[3]=x[0]*x[1]
df = df.apply(multiply, axis=0)
The example above returns an empty df. Can anyone shed some light on my misunderstanding?
import pandas as pd
import numpy as np
grid = np.random.rand(5,2)
df = pd.DataFrame(grid)
def multiply(x):
return x[0]*x[1]
df['multiply'] = df.apply(multiply, axis = 1)
print(df)
Results in:
0 1 multiply
0 0.550750 0.713054 0.392715
1 0.061949 0.661614 0.040987
2 0.472134 0.783479 0.369907
3 0.827371 0.277591 0.229670
4 0.961102 0.137510 0.132162
Explanation:
The function you are applying, needs to return a value. You are also applying this to each row, not column. The axis parameter you passed was incorrect in this regard.
Finally, notice that I am setting this equal to the 'multiply' column outside of my function. You can easily change this to be df[3] = ... like you have and get a dataframe like this:
0 1 3
0 0.550750 0.713054 0.392715
1 0.061949 0.661614 0.040987
2 0.472134 0.783479 0.369907
3 0.827371 0.277591 0.229670
4 0.961102 0.137510 0.132162
It should be noted that you can use lambda functions as well. See their documentation Apply
For your example, you can run:
df['multiply'] = df.apply(lambda row: row[0] * row[1], axis = 1)
which produces the same output as #Andy
This can be useful if your function is in the form of
def multiply(a,b):
return a*b
df['multiply'] = df.apply(lambda row: multiply(row[0] ,row[1]), axis = 1)
More examples in the section Enhancing Performance
When apply-ing a function, you need that function to return the result for that operation over the column/row. You are getting None because multiply doesn't return, evidently. That is, apply should be returning a result between particular values, not doing the assignment itself.
You're also iterating over the wrong axis, here. Your current code takes the first and second element of each column and multiplies them together.
A correct multiply function:
def multiply(x):
return x[0]*x[1]
df[3] = df.apply(multiply, 'columns')
With that being said, you can do much better than apply here, as it is not a vectorized operation. Just multiply the columns together directly.
df[3] = df[0]*df[1]
In general, you should avoid apply when possible as it is not much more than a loop itself under the hood.
One of the rules of Pandas Zen says: always try to find a vectorized solution first.
.apply(..., axis=1) is not vectorized!
Consider alternatives:
In [164]: df.prod(axis=1)
Out[164]:
0 0.770675
1 0.539782
2 0.318027
3 0.597172
4 0.211643
dtype: float64
In [165]: df[0] * df[1]
Out[165]:
0 0.770675
1 0.539782
2 0.318027
3 0.597172
4 0.211643
dtype: float64
Timing against 50.000 rows DF:
In [166]: df = pd.concat([df] * 10**4, ignore_index=True)
In [167]: df.shape
Out[167]: (50000, 2)
In [168]: %timeit df.apply(multiply, axis=1)
1 loop, best of 3: 6.12 s per loop
In [169]: %timeit df.prod(axis=1)
100 loops, best of 3: 6.23 ms per loop
In [170]: def multiply_vect(x1, x2):
...: return x1*x2
...:
In [171]: %timeit multiply_vect(df[0], df[1])
1000 loops, best of 3: 604 µs per loop
Conclusion: use .apply() as a very last resort (i.e. when nothing else helps)

Efficiently determining if large sorted numpy array has only unique values

I have a very large numpy array and I want to sort it and test if it is unique.
I'm aware of the function numpy.unique but it sorts the array another time to achieve it.
The reason I need the array sorted a priori is because the returned keys from the argsort function will be used to reorder another array.
I'm looking for a way to do both (argsort and unique test) without the need to sort the array again.
Example code:
import numpy as np
import numpy.random
# generating random arrays with 2 ^ 27 columns (it can grow even bigger!)
slices = np.random.random_integers(2 ** 32, size = 2 ** 27)
values = np.random.random_integers(2 ** 32, size = 2 ** 27)
# get an array of keys to sort slices AND values
# this operation takes a long time
sorted_slices = slices.argsort()
# sort both arrays
# it would be nice to make this operation in place
slices = slices[sorted_slices]
values = values[sorted_slices]
# test 'uniqueness'
# here, the np.unique function sorts the array again
if slices.shape[0] == np.unique(slices).shape[0]:
print('it is unique!')
else:
print('not unique!')
Both the arrays slices and values have 1 row and the same (huge) number of columns.
Thanks in advance.
You can check whether there are two or more equal values next to each other (non-unique values in a sorted array) by comparing their difference to 0
numpy.any(numpy.diff(slices) == 0)
Be aware though that numpy will create two intermediate arrays: one with the difference values, one with boolean values.
Here's an approach making use of slicing and instead of actual differentiation, we can just compare each element against the previous one without actually computing the differentiation value, like so -
~((slices[1:] == slices[:-1]).any())
Runtime test -
In [54]: slices = np.sort(np.random.randint(0,100000000,(10000000)))
# #Nils Werner's soln
In [55]: %timeit ~np.any(np.diff(slices) == 0)
100 loops, best of 3: 18.5 ms per loop
# #Marco's suggestion in comments
In [56]: %timeit np.diff(slices).all()
10 loops, best of 3: 20.6 ms per loop
# Proposed soln in this post
In [57]: %timeit ~((slices[1:] == slices[:-1]).any())
100 loops, best of 3: 6.12 ms per loop

Apply numpy index to matrix

I have spent the last hour trying to figure this out
Suppose we have
import numpy as np
a = np.random.rand(5, 20) - 0.5
amin_index = np.argmin(np.abs(a), axis=1)
print(amin_index)
> [ 0 12 5 18 1] # or something similar
this does not work:
a[amin_index]
So, in essence, I need to find the minima along a certain axis for the array np.abs(a), but then extract the values from the array a at these positions. How can I apply an index to just one axis?
Probably very simple, but I can't get it figured out. Also, I can't use any loops since I have to do this for arrays with several million entries.
thanks 😊
One way is to pass in the array of row indexes (e.g. [0,1,2,3,4]) and the list of column indexes for the minimum in each corresponding row (your list amin_index).
This returns an array containing the value at [i, amin_index[i]] for each row i:
>>> a[np.arange(a.shape[0]), amin_index]
array([-0.0069325 , 0.04268358, -0.00128002, -0.01185333, -0.00389487])
This is basic indexing (rather than advanced indexing), so the returned array is actually a view of a rather than a new array in memory.
Is because argmin returns indexes of columns for each of the rows (with axis=1), therefore you need to access to each row at those particular columns:
a[range(a.shape[0]), amin_index]
Why not simply do np.amin(np.abs(a), axis=1), it's much simpler if you don't need the intermediate amin_index array via argmin?
Numpy's reference page is an excellent resource, see "Indexing".
Edits: Timing is always useful:
In [3]: a=np.random.rand(4000, 4000)-.5
In [4]: %timeit np.amin(np.abs(a), axis=1)
10 loops, best of 3: 128 ms per loop
In [5]: %timeit a[np.arange(a.shape[0]), np.argmin(np.abs(a), axis=1)]
10 loops, best of 3: 135 ms per loop

Categories