Python delete row in numpy array - python

I have a large numpy array (8 by 30000) and I want to delete some rows according to some criteria. This criteria is only applicable in one column.
Example:
>>> p = np.array([[0, 1, 3], [1 , 5, 6], [4, 3, 56], [1, 34, 4]])
>>> p
array([[ 0, 1, 3],
[ 1, 5, 6],
[ 4, 3, 56],
[ 1, 34, 4]])
here I would like to remove every row in which the value of the 3rd column is >30, ie. here row 3.
As the array is pretty large, I'd like to avoid for loops. I thought of this:
>>> a[~(a>30).any(1), :]
array([[0, 1, 3],
[1, 5, 6]])
But there, it obviously removes the two last rows. Any ideas on how to do that in a efficient way?

p = p[~(p[:,2] > 30)]
or (if your condition is easily inversible):
p = p[p[:,2] <= 30]
returns
array([[ 0, 1, 3],
[ 1, 5, 6],
[ 1, 34, 4]])

Related

How to index a numpy array with another numpy array in python

I am trying to index an np.array with another array so that I can have zeros everywhere after a certain index but it gives me the error
TypeError: only integer scalar arrays can be converted to a scalar
index
Basically what I would like my code to do is that if I have:
a = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]])
d = np.array([2, 1, 3])
that I could do something like
a[d:] = 0
to give the output
a = [[ 1 2 3]
[ 4 0 6]
[ 0 0 9]
[ 0 0 0]]
It can be done with array indexing but it doesn't feel natural.
import numpy as np
a = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]])
d = np.array([2, 1, 3])
col_ix = [ 0, 0, 1, 1, 1, 2 ] # column ix for each item to change
row_ix = [ 2, 3, 1, 2, 3, 3 ] # row index for each item to change
a[ row_ix, col_ix ] = 0
a
# array([[1, 2, 3],
# [4, 0, 6],
# [0, 0, 9],
# [0, 0, 0]])
With a for loop
a = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]])
for ix_col, ix_row in enumerate( d ): # iterate across the columns
a[ ix_row:, ix_col ] = 0
a
# array([[1, 2, 3],
# [4, 0, 6],
# [0, 0, 9],
# [0, 0, 0]])
A widely used approach for this kind of problem is to construct a boolean mask, comparing the index array with the appropriate arange:
In [619]: mask = np.arange(4)[:,None]>=d
In [620]: mask
Out[620]:
array([[False, False, False],
[False, True, False],
[ True, True, False],
[ True, True, True]])
In [621]: a[mask]
Out[621]: array([ 5, 7, 8, 10, 11, 12])
In [622]: a[mask] = 0
In [623]: a
Out[623]:
array([[1, 2, 3],
[4, 0, 6],
[0, 0, 9],
[0, 0, 0]])
That's not necessarily faster than a row (or in this case column) iteration. Since slicing is basic indexing, it may be faster, even if done several times.
In [624]: for i,v in enumerate(d):
...: print(a[v:,i])
...:
[0 0]
[0 0 0]
[0]
Generally if a result involves multiple arrays or lists with different lengths, there isn't a "neat" multidimensional solution. Either iterate over those lists, or step back and "think outside the box".

how to merge several arrays stored in list

I want to concatenate several arrays store in list. Length of the arrays are different. I already read this solution, but unfortunately I could not solve my problem. This is is simplified input data:
arr_all= [array([[1 ,2 , 10],
[5, 8, 3]]),
array([[1, 0, 5]]),
array([[0, 1, 8]]),
array([[9, 13, 0]]),
array([[2, 10, 2],
[1.1, 3, 3]]),
array([[25, 0, 0]])]
n_data_sets=2
n_repetition=3
Now, I want to merge (concatenate) the first array of arr_all (arr_all[0]) with the fourth one (arr_all[3]), the second (arr_all[1]) with the fifth one (arr_all[4]) and the third one (arr_all[2]) with the last one (arr_all[5]). In fact here I have two data sets (n_data_sets=2) which are repeated three times (n_repetition=3). In reality I have several data sets that are repeated tens of times. I want to put each data set in a single array of my list. I can say the input is sorted based on the repetition but I want make it based on the data sets of each repetition. My expected result is:
arr_all= [array([[1, 2 , 10],
[5, 8, 3],
[9, 13, 0]]),
array([[1, 0, 5],
[2, 10, 2],
[1.1, 3, 3]]),
array([[0, 1, 8],
[25, 0, 0]])]
My input data was a list with six arrays (n_repetition times n_data_sets) but my result has n_repetition arrays.
In advance I appreciate any feedback.
To further Alexander's response, this is what I came up with:
import numpy as np
arr_all = [np.array([[1, 2, 10], [5, 8, 3]]),
np.array([[1, 0, 5]]),
np.array([[0, 1, 8]]),
np.array([[9, 13, 0]]),
np.array([[2, 10, 2], [1.1, 3, 3]]),
np.array([[25, 0, 0]])]
n_data_sets = 2
n_repetition = 3
new_array = []
for i in range(n_repetition):
dataset = arr_all[i]
for j in range(n_data_sets-1):
dataset = np.concatenate([dataset, arr_all[i+(n_repetition*(j+1))]])
new_array.append(dataset)
print(new_array)
I also found a cleaner method, but which is possibly worse in terms of time:
import numpy as np
arr_all = [np.array([[1, 2, 10], [5, 8, 3]]),
np.array([[1, 0, 5]]),
np.array([[0, 1, 8]]),
np.array([[9, 13, 0]]),
np.array([[2, 10, 2], [1.1, 3, 3]]),
np.array([[25, 0, 0]])]
n_data_sets = 2
n_repetition = 3
reshaped = np.reshape(arr_all, (n_repetition, n_data_sets), order='F')
new = []
for arr in reshaped:
new.append(np.concatenate(arr))
print(new)
Two merge always the first half with the seconds half (if this was your intention), you can do something like this (which will work if you have an even amount of arrays.
import numpy as np
arr_all= [np.array([[1 ,2 , 10],
[5, 8, 3]]),
np.array([[1, 0, 5]]),
np.array([[0, 1, 8]]),
np.array([[9, 13, 0]]),
np.array([[2, 10, 2],
[1.1, 3, 3]]),
np.array([[25, 0, 0]])]
half = int(len(arr_all)/2)
new = []
for i in range(half):
new.append(np.concatenate((arr_all[i],arr_all[i+half]), axis=0))
print(new)

find the index of largest, 2nd largest or third largest in a row of multidimensional array pyhton

Let say I have the following two numpy arrays:
a = numpy.array([[1,4,6,2,5],[3,2,7,12,1],[8,5,3,1,4],[6,10,2,4,9]])
b = numpy.array([0, 1, 4])
now, I want to first search for the index of the max value in a specific row (just say second row a[1,:]), I have another array b with some numbers in it and if the index of the max value is present as an element in the other b, i use that index number for some more calculation. If the index of max value from a row of a is not present as an element in b, I need to look for the index of the 2nd largest number, and if that index value is present in b as an element, i go for it otherwise look for the index of third largest number and so on. I do not want to sort the array a.
In the above example, I have [3,2,7,12,1] in the second row.The index of the max number is 3 but that 3 is not present in b, then index of 2nd largest is 2, which is also not present in b, then i look for the index of third largest which is 0 and that is present in b. So I then assign 0 to a new variable. Any quick and fast way of doing it? Thanks in advance.
Here's one that scales well for generic ndarrays -
def maxindex(a, b, fillna=-1):
sidx = a.argsort(-1)
m = np.isin(sidx,b)
idx = m.shape[-1] - m[...,::-1].argmax(-1) - 1
out = np.take_along_axis(sidx,idx[...,None],axis=-1).squeeze()
return np.where(m.any(-1), out, fillna)
Sample runs -
In [83]: a
Out[83]:
array([[ 1, 4, 6, 2, 5],
[ 3, 2, 7, 12, 1],
[ 8, 5, 3, 1, 4],
[ 6, 10, 2, 4, 9]])
In [84]: b
Out[84]: array([0, 1, 4])
In [85]: maxindex(a, b) # all rows
Out[85]: array([4, 0, 0, 1])
In [86]: maxindex(a[1], b) # second row
Out[86]: array([0])
3D case -
In [105]: a
Out[105]:
array([[[ 1, 4, 6, 2, 5],
[ 3, 2, 7, 12, 1],
[ 8, 5, 3, 1, 4],
[ 6, 10, 2, 4, 9]],
[[ 1, 4, 6, 2, 5],
[ 3, 2, 7, 12, 1],
[ 8, 5, 3, 1, 4],
[ 6, 10, 2, 4, 9]]])
In [106]: maxindex(a, b)
Out[106]:
array([[4, 0, 0, 1],
[4, 0, 0, 1]])
IIUC, if you need to keep the index from a:
res = b[a[:, b].argmax(1)]
array([4, 0, 0, 1])
Alternatively:
a[:, np.delete(np.arange(a.shape[1]), b)] = a.min()
res = a.argmax(1)
array([4, 0, 0, 1], dtype=int64)
If the index of a masked array is sufficient:
res = a[:, b].argmax(1)
array([2, 0, 0, 1], dtype=int64)

Append columns to numpy array [duplicate]

This question already has answers here:
How do I add an extra column to a NumPy array?
(17 answers)
Closed 4 years ago.
If I set up a four column zeros array:
X_large = np.zeros((X.shape[0], 4)
And I have an array X as below:
X = np.array([
[0, 1]
[2, 2]
[3, 4]
[6, 5]
])
How can I get X_large to take X and have the last two columns show the first value of each row of the array squared and the second value of each row of the array cubed? Meaning:
X_large = [ [0, 1, 0, 1],
[2, 2, 4, 8],
[3, 4, 9, 64],
[6, 5, 36, 125] ]
This is probably not too hard, but I'm a pretty novice programmer in general.
Thanks!
Calculate the power of X first, then do the column_stack:
np.column_stack((X, X ** [2,3]))
#array([[ 0, 1, 0, 1],
# [ 2, 2, 4, 8],
# [ 3, 4, 9, 64],
# [ 6, 5, 36, 125]])
Or use np.power for the power calculation:
np.column_stack((X, np.power(X, [2,3])))
#array([[ 0, 1, 0, 1],
# [ 2, 2, 4, 8],
# [ 3, 4, 9, 64],
# [ 6, 5, 36, 125]])

Efficient way of making a list of pairs from an array in Numpy

I have a numpy array x (with (n,4) shape) of integers like:
[[0 1 2 3],
[1 2 7 9],
[2 1 5 2],
...]
I want to transform the array into an array of pairs:
[0,1]
[0,2]
[0,3]
[1,2]
...
so first element makes a pair with other elements in the same sub-array. I have already a for-loop solution:
y=np.array([[x[j,0],x[j,i]] for i in range(1,4) for j in range(0,n)],dtype=int)
but since looping over numpy array is not efficient, I tried slicing as the solution. I can do the slicing for every column as:
y[1]=np.array([x[:,0],x[:,1]]).T
# [[0,1],[1,2],[2,1],...]
I can repeat this for all columns. My questions are:
How can I append y[2] to y[1],... such that the shape is (N,2)?
If number of columns is not small (in this example 4), how can I find y[i] elegantly?
What are the alternative ways to achieve the final array?
The cleanest way of doing this I can think of would be:
>>> x = np.arange(12).reshape(3, 4)
>>> x
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
>>> n = x.shape[1] - 1
>>> y = np.repeat(x, (n,)+(1,)*n, axis=1)
>>> y
array([[ 0, 0, 0, 1, 2, 3],
[ 4, 4, 4, 5, 6, 7],
[ 8, 8, 8, 9, 10, 11]])
>>> y.reshape(-1, 2, n).transpose(0, 2, 1).reshape(-1, 2)
array([[ 0, 1],
[ 0, 2],
[ 0, 3],
[ 4, 5],
[ 4, 6],
[ 4, 7],
[ 8, 9],
[ 8, 10],
[ 8, 11]])
This will make two copies of the data, so it will not be the most efficient method. That would probably be something like:
>>> y = np.empty((x.shape[0], n, 2), dtype=x.dtype)
>>> y[..., 0] = x[:, 0, None]
>>> y[..., 1] = x[:, 1:]
>>> y.shape = (-1, 2)
>>> y
array([[ 0, 1],
[ 0, 2],
[ 0, 3],
[ 4, 5],
[ 4, 6],
[ 4, 7],
[ 8, 9],
[ 8, 10],
[ 8, 11]])
Like Jaimie, I first tried a repeat of the 1st column followed by reshaping, but then decided it was simpler to make 2 intermediary arrays, and hstack them:
x=np.array([[0,1,2,3],[1,2,7,9],[2,1,5,2]])
m,n=x.shape
x1=x[:,0].repeat(n-1)[:,None]
x2=x[:,1:].reshape(-1,1)
np.hstack([x1,x2])
producing
array([[0, 1],
[0, 2],
[0, 3],
[1, 2],
[1, 7],
[1, 9],
[2, 1],
[2, 5],
[2, 2]])
There probably are other ways of doing this sort of rearrangement. The result will copy the original data in one way or other. My guess is that as long as you are using compiled functions like reshape and repeat, the time differences won't be significant.
Suppose the numpy array is
arr = np.array([[0, 1, 2, 3],
[1, 2, 7, 9],
[2, 1, 5, 2]])
You can get the array of pairs as
import itertools
m, n = arr.shape
new_arr = np.array([x for i in range(m)
for x in itertools.product(a[i, 0 : 1], a[i, 1 : n])])
The output would be
array([[0, 1],
[0, 2],
[0, 3],
[1, 2],
[1, 7],
[1, 9],
[2, 1],
[2, 5],
[2, 2]])

Categories