Extract numpy rows by given condition

Extract numpy rows by given condition - python

I have numpy array as follows.
import numpy as np
data = np.array([[0,0,0,4],
[3,0,5,0],
[8,9,5,3]])
print (data)
I have to extract only those lines which first three elements are not all zeros
expected result is as follows:
result = np.array([[3,0,5,0],
[8,9,5,3]])
I tried as:
res = [l for l in data if l[:3].sum() !=0]
print (res)
It gives result. But, looking for better, numpy way of doing it.

sum is a bit unreliable if your array can contain negative numbers, but any will always work:
result = data[data[:, :3].any(1)]

You say
first three elements are not all zeros
so a solution is
import numpy as np
data = np.array([[0,0,0,4],
[3,0,5,0],
[8,9,5,3]])
data[~np.all(data[:, :3] == 0, axis=1), :]

I'll try to explaing how I think about these kinds of problems through my answer.
First step: define a function that returns a boolean indicating whether this is a good row.
For that, I use np.any, which checks if any of the entries is "True" (for integers, true is non-zero).
import numpy as np
v1 = np.array([1, 1, 1, 0])
v2 = np.array([0, 0, 0, 1])
good_row = lambda v: np.any(v[:3])
good_row(v1)
Out[28]: True
good_row(v2)
Out[29]: False
Second step: I apply this on all rows, and obtain a masking vector. To do so, one can use the 'axis' keyword in 'np.any', which will apply this on columns or rows depending on the axis value.
np.any(data[:, :3], axis=1)
Out[32]: array([False, True, True])
Final step: I combine this with indexing, to wrap it all.
rows_inds = np.any(data[:, :3], axis=1)
data[rows_inds]
Out[37]:
array([[3, 0, 5, 0],
[8, 9, 5, 3]])

Related

Get column indices of row-wise maximum values of a 2D array (with random tie-breaking)

Given a 2D numpy array, I want to construct an array out of the column indices of the maximum value of each row. So far, arr.argmax(1) works well. However, for my specific case, for some rows, 2 or more columns may contain the maximum value. In that case, I want to select a column index randomly (not the first index as it is the case with .argmax(1)).
For example, for the following arr:
arr = np.array([
[0, 1, 0],
[1, 1, 0],
[2, 1, 3],
[3, 2, 2]
])
there can be two possible outcomes: array([1, 0, 2, 0]) and array([1, 1, 2, 0]) each chosen with 1/2 probability.
I have code that returns the expected output using a list comprehension:
idx = np.arange(arr.shape[1])
ans = [np.random.choice(idx[ix]) for ix in arr == arr.max(1, keepdims=True)]
but I'm looking for an optimized numpy solution. In other words, how do I replace the list comprehension with numpy methods to make the code feasible for bigger arrays?

Use scipy.stats.rankdata and apply_along_axis as follows.
import numpy as np
from scipy.stats import rankdata
ranks = rankdata(-arr, axis = 1, method = "min")
func = lambda x: np.random.choice(np.where(x==1)[0])
idx = np.apply_along_axis(func, 1, ranks)
print(idx)
It returns [1 0 2 0] or [1 1 2 0].
The main idea is rankdata calculates ranks of every value in each row, and the maximum value will have 1. func randomly choices one of index whose corresponding value is 1. Finally, apply_along_axis applies the func to every row of arr.

After some advice I got offline, it turns out that randomization of maximum values are possible when we multiply the boolean array that flags row-wise maximum values by a random array of the same shape. Then what remains is a simple argmax(1) call.
# boolean array that flags maximum values of each row
mxs = arr == arr.max(1, keepdims=True)
# random array where non-maximum values are zero and maximum values are random values
random_arr = np.random.rand(*arr.shape) * mxs
# row-wise maximum of the auxiliary array
ans = random_arr.argmax(1)
A timeit test shows that for data of shape (507_563, 12), this code runs in ~172 ms on my machine while the loop in the question runs for 11 sec, so this is about 63x faster.

How to do element-wise comparison between two NumPy arrays

I have two arrays. I would like to do an element-wise comparison between the two of them to find out which values are the same.
a= np.array([[1,2],[3,4]])
b= np.array([[3,2],[1,4]])
Is there a way for me to compare these two arrays to 1) find out which values are the same and 2) get the index of the same values?
Adding on to the previous question, is there a way for me to return 1 if the values are the same and 0 otherwise?
Thanks in advance!

a= np.array([[1,2],[3,4]])
b= np.array([[3,2],[1,4]])
#1) find out which values are the same
a==b
# array([[False, True],
# [False, True]])
#2) get the index of the same values?
np.where((a==b) == True) # or np.where(a==b)
#(array([0, 1]), array([1, 1]))
# Adding on to the previous question, is there a way for me to return 1 if the values are the same and 0 otherwise
(a==b).astype(int)
# array([[0, 1],
# [0, 1]])

Mix three vectors in a numpy array, then sort it

I have 3 numpy.ndarray vectors, X, Y and intensity. I would like to mix it in an numpy array, then sort by the third column (or the first one). I tried the following code:
m=np.column_stack((X,Y))
m=np.column_stack((m,intensity))
m=np.sort(m,axis=2)
Then I got the error: ValueError: axis(=2) out of bounds.
When I print m, I get:
array([[ 109430, 285103, 121],
[ 134497, 284907, 134],
[ 160038, 285321, 132],
...,
[12374406, 2742429, 148],
[12371858, 2741994, 148],
[12372221, 2742017, 161]])
How can I fix it. that is, get a sorted array?

Axis=2 does not refer to the column index but rather, to the dimension of the array. It means numpy will try to look for a third dimension in the data and sorts it from smallest to largest in the third dimension. Sorting from smallest to largest in the first dimension (axis = 0) would be have the values in all rows going from smallest to largest. Sorting from smallest to largest in the second dimension (axis = 1) would be have the values in all columns going from smallest to largest. Examples would be below.
Furthermore, sort would work differently depending on the base array. Two arrays are considered: Unstructured and structured.
Unstructured
X = np.nrandn(10)
X = np.nrandn(10)
intensity = np.nrandn(10)
m=np.column_stack((X,Y))
m=np.column_stack((m,intensity))
m is being treated as an unstructured array because there are no fields linked to any of the columns. In other words, if you call np.sort() on m, it will just sort them from smallest to largest from top to bottom if axis=0 and left to right if axis=1. The rows are not being preserved.
Original:
[[ 1.20122251 1.41451461 -1.66427245]
[ 1.3657312 -0.2318793 -0.23870104]
[-0.30280613 0.79123814 -1.64082042]]
Axis=1:
[[-1.66427245 1.20122251 1.41451461]
[-0.23870104 -0.2318793 1.3657312 ]
[-1.64082042 -0.30280613 0.79123814]]
Axis = 0:
[[-0.30280613 -0.2318793 -1.66427245]
[ 1.20122251 0.79123814 -1.64082042]
[ 1.3657312 1.41451461 -0.23870104]]
Structured
As you can see, the data structure in the rows is not kept. If you would like to preserve the row order, you need to add in labels to the datatypes and create an array with this. You can sort by the other columns with order = label_name.
dtype = [("a",float),("b",float),("c",float)]
m = [tuple(x) for x in m]
labelled_arr = np.array(m,dtype)
print np.sort(labelled_arr,order="a")
This will get:
[(-0.30280612629541204, 0.7912381363389004, -1.640820419927318)
(1.2012225144719493, 1.4145146097431947, -1.6642724545574712)
(1.3657312047892836, -0.23187929505306418, -0.2387010374198555)]
Another more convenient way of doing this would be passing the data into a pandas dataframe which automatically creates column names from 0 to n-1. Then you can just call the sort_values method and pass in the column index you want and follow it by axis=0 if you would like it to be sorted from top to bottom just like in numpy.
Example:
pd.DataFrame(m).sort_values(0,axis = 0)
Output:
0 1 2
2 -0.302806 0.791238 -1.640820
0 1.201223 1.414515 -1.664272
1 1.365731 -0.231879 -0.238701

You are getting that error because you don't have an axis with a 2 index. Axes are zero-indexed. Regardless, np.sort will sort every column, or every row. Consider from the docs:
order : str or list of str, optional When a is an array with fields
defined, this argument specifies which fields to compare first,
second, etc. A single field can be specified as a string, and not all
fields need be specified, but unspecified fields will still be used,
in the order in which they come up in the dtype, to break ties.
For example:
In [28]: a
Out[28]:
array([[0, 0, 1],
[1, 2, 3],
[3, 1, 8]])
In [29]: np.sort(a, axis = 0)
Out[29]:
array([[0, 0, 1],
[1, 1, 3],
[3, 2, 8]])
In [30]: np.sort(a, axis = 1)
Out[30]:
array([[0, 0, 1],
[1, 2, 3],
[1, 3, 8]])
So, I think what you really want is this neat little idiom:
In [32]: a[a[:,2].argsort()]
Out[32]:
array([[0, 0, 1],
[1, 2, 3],
[3, 1, 8]])

Remove one value from a NumPy array

I am trying to all rows that only contain zeros from a NumPy array. For example, I want to remove [0,0] from
n = np.array([[1,2], [0,0], [5,6]])
and be left with:
np.array([[1,2], [5,6]])

To remove the second row from a numpy table:
import numpy
n = numpy.array([[1,2],[0,0],[5,6]])
new_n = numpy.delete(n, 1, axis=0)
To remove rows containing only 0:
import numpy
n = numpy.array([[1,2],[0,0],[5,6]])
idxs = numpy.any(n != 0, axis=1) # index of rows with at least one non zero value
n_non_zero = n[idxs, :] # selection of the wanted rows

If you want to delete any row that only contains zeros, the fastest way I can think of is:
n = numpy.array([[1,2], [0,0], [5,6]])
keep_row = n.any(axis=1) # Index of rows with at least one non-zero value
n_non_zero = n[keep_row] # Rows to keep, only
This runs much faster than Simon's answer, because n.any() stops checking the values of each row as soon as it encounters any non-zero value (in Simon's answer, all the elements of each row are compared to zero first, which results in unnecessary computations).
Here is a generalization of the answer, if you ever need to remove a rows that have a specific value (instead of removing only rows that only contain zeros):
n = numpy.array([[1,2], [0,0], [5,6]])
to_be_removed = [0, 0] # Can be any row values: [5, 6], etc.
other_rows = (n != to_be_removed).any(axis=1) # Rows that have at least one element that differs
n_other_rows = n[other_rows] # New array with rows equal to to_be_removed removed.
Note that this solution is not fully optimized: even if the first element of to_be_removed does not match, the remaining row elements from n are compared to those of to_be_removed (as in Simon's answer).
I'd be curious to know if there is a simple efficient NumPy solution to the more general problem of deleting rows with a specific value.
Using cython loops might be a fast solution: for each row, element comparison could be stopped as soon as one element from the row differs from the corresponding element in to_be_removed.

You can use numpy.delete to remove specific rows or columns.
For example:
n = [[1,2], [0,0], [5,6]]
np.delete(n, 1, axis=0)
The output will be:
array([[1, 2],
[5, 6]])

To delete according to value,which is an Object.
To do like this:
>>> n
array([[1, 2],
[0, 0],
[5, 6]])
>>> bl=n==[0,0]
>>> bl
array([[False, False],
[ True, True],
[False, False]], dtype=bool)
>>> bl=np.any(bl,axis=1)
>>> bl
array([False, True, False], dtype=bool)
>>> ind=np.nonzero(bl)[0]
>>> ind
array([1])
>>> np.delete(n,ind,axis=0)
array([[1, 2],
[5, 6]])

How to invert numpy.where (np.where) function

I frequently use the numpy.where function to gather a tuple of indices of a matrix having some property. For example
import numpy as np
X = np.random.rand(3,3)
>>> X
array([[ 0.51035326, 0.41536004, 0.37821622],
[ 0.32285063, 0.29847402, 0.82969935],
[ 0.74340225, 0.51553363, 0.22528989]])
>>> ix = np.where(X > 0.5)
>>> ix
(array([0, 1, 2, 2]), array([0, 2, 0, 1]))
ix is now a tuple of ndarray objects that contain the row and column indices, whereas the sub-expression X>0.5 contains a single boolean matrix indicating which cells had the >0.5 property. Each representation has its own advantages.
What is the best way to take ix object and convert it back to the boolean form later when it is desired? For example
G = np.zeros(X.shape,dtype=np.bool)
>>> G[ix] = True
Is there a one-liner that accomplishes the same thing?

Something like this maybe?
mask = np.zeros(X.shape, dtype='bool')
mask[ix] = True
but if it's something simple like X > 0, you're probably better off doing mask = X > 0 unless mask is very sparse or you no longer have a reference to X.

mask = X > 0
imask = np.logical_not(mask)
For example
Edit: Sorry for being so concise before. Shouldn't be answering things on the phone :P
As I noted in the example, it's better to just invert the boolean mask. Much more efficient/easier than going back from the result of where.

The bottom of the np.where docstring suggests to use np.in1d for this.
>>> x = np.array([1, 3, 4, 1, 2, 7, 6])
>>> indices = np.where(x % 3 == 1)[0]
>>> indices
array([0, 2, 3, 5])
>>> np.in1d(np.arange(len(x)), indices)
array([ True, False, True, True, False, True, False], dtype=bool)
(While this is a nice one-liner, it is a lot slower than #Bi Rico's solution.)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract numpy rows by given condition - python

sum is a bit unreliable if your array can contain negative numbers, but any will always work: result = data[data[:, :3].any(1)]

You say first three elements are not all zeros so a solution is import numpy as np data = np.array([[0,0,0,4], [3,0,5,0], [8,9,5,3]]) data[~np.all(data[:, :3] == 0, axis=1), :]

Related

Get column indices of row-wise maximum values of a 2D array (with random tie-breaking)

How to do element-wise comparison between two NumPy arrays

Mix three vectors in a numpy array, then sort it

Remove one value from a NumPy array

How to invert numpy.where (np.where) function

Categories

Resources