Select rows from array that are greater than template - python

Now I have one 2D Numpy array of float values, i.e. a, and its shape is (10^6, 3).
I want to know which rows are greater than np.array([25.0, 25.0, 25.0]). And then outputting the rows that satisfy this condition.
My code appears as follows.
# Create an empty array
a_cut = np.empty(shape=(0, 3), dtype=float)
minimum = np.array([25.0, 25.0, 25.0])
for i in range(len(a)):
if a[i,:].all() > minimum.all():
a_cut = np.append(a_cut, a[i,:], axis=0)
However, the code is inefficient. After a few hours, the result has not come out.
So Is there a way to improve the speed of this loop?

np.append re-allocates the entire array every time you call it. It is basically the same as np.concatenate: use it very sparingly. The goal is to perform the entire operation in bulk.
You can construct a mask:
mask = (a > minimum).all(axis=1)
Then select:
a_cut = a[mask, :]
You may get a slight improvement from using indices instead of a boolean mask:
a_cut = a[np.flatnonzero(mask), :]
Indexing with fewer indices than there are dimensions applies the indices to the leading dimensions, so you can do
a_cut = a[mask]
The one liner is therefore:
a_cut = a[(a > minimium).all(1)]

Related

Call numpy ravel and include an interger multi-index

I have 4-dimensional array. I am going to turn it into a 1-dim array. I use numpy ravel and it works fin with the default parameters.
However I would also like the positions/indices in the 4-dim array.
I want something like this as row in my output.
x,y,z,w,value
With x being the first dimension of my initial array and so on.
The obvious approach is iteration, however I was told to avoid it when I can.
for i in range(test.shape[0]):
for j in range(test.shape[1]):
for k in range(test.shape[2]):
for l in range(test.shape[3]):
print(i,j,k,l,test[(i,j,k,l)])
It will be to slow when I use a larger dataset.
Is there a way to configure ravel to do this or any other approach faster than iteration.
Use np.indices with sparse=False, combined with np.concatenate to build the array. np.indices provides the first n columns, and np.concatenate appends the last one:
test = np.random.randint(10, size=(3, 5, 4, 2))
index = np.indices(test.shape, sparse=False) # shape: (4, 3, 5, 4, 2)
data = np.concatenate((index, test[None, ...]), axis=0).reshape(test.ndim + 1, -1).T
A more detailed breakdown:
index is a (4, *test.shape) array, with one element per dimension.
To make test concatenatable with index, you need to prepend a unit dimension, which is what test[None, ...] does. None is synonymous with np.newaxis, and Ellipsis, or ..., means "all the remaining dimensions".
When you concatenate along axis=0, you are appending test to the array of indices. Each element of index along the first axis is now a 5-element array containing the index followed by the value. The remaining axes reflect the shape of test, but besides that, you have what you want.
The goal is to flatten out the trailing dimensions, so you get a (5, N) array, where N = np.prod(test.shape). Thats what the final reshape does. test.ndim + 1 is the size of the index +1 for the value. -1 can appear exactly once in a reshape. It means "product of all the remaining dimensions".

Remove row of numpy array based on values from two columns

I have a numpy array with four columns and many rows:
>>> dat
array([['4/5/2004', '17', 0.0, 0.0],
['4/5/2004', '7', 0.0, 0.0],
['4/5/2004', '19:48:20', 58.432488, -135.9202205],
['4/5/2004', '19:48:32', 58.432524300000004, 0.0],
['4/5/2004', '19:48:36', 58.4325365, -150.9202813]], dtype=object)
I would like to remove all rows where the value in columns 3 or 4 equal 0 so the result would be:
([['4/5/2004', '19:48:20', 58.432488, -135.9202205],
['4/5/2004', '19:48:36', 58.4325365, -150.9202813]])
I can do this one column at a time with:
a = dat[~(dat[:,2]==0), :]
Which returns rows where value in column 3 does not equal 0. I could do this iteratively for multiple columns, but it would be convenient to do it all in one command.
I thought something like the following two examples would work (but they do not):
a = dat[~(dat[:,2]==0), :] or dat[~(dat[:,3]==0), :]
a = dat[~(dat[:,2&3]==0), :]
Hopefully there's some simple syntax I'm missing and can't find in the numpy help.
Assuming the data array is 2D, we could slice and look for the valid ones -
dat[~(dat[:,2:4]==0).any(1)]
Alternatively, we can use np.all on the !=0 ones -
dat[(dat[:,2:4]!=0).all(1)]
When the columns of interest are not contiguous ones, we need to slice them using those column IDs and use the same technique. So, let's say the column IDs to be examined are stored in an array or list named colID, then we would have the approaches modified, like so -
dat[~(dat[:,colID]==0).any(1)]
dat[(dat[:,colID]!=0).all(1)]
Thus, for the stated case of columns 3 and 4, we would have : colID = [2,3].
What about using &:
>>> dat[(dat[:,2] != 0) & (dat[:,3] != 0), :]
array([['4/5/2004', '19:48:20', 58.432488, -135.9202205],
['4/5/2004', '19:48:36', 58.4325365, -150.9202813]], dtype=object)
which yields the element-wise "and".
I've changed it for != 0 thus the & which avoids the additional inversions with ~.
You got the idea of using or conceptually correct. The main difference is that you want to do logical or (|) or logical and (&) (just like you are using logical not (~)).
This works because an operation like dat[:,3] == 0 creates an array or booleans of the same size as a column of dat. When this array is used as an index, numpy interprets it as a mask. Splitting off the mask array to highlight this concept:
mask = (dat[:, 2] != 0) & (dat[:, 3] != 0)
dat = dat[mask, :]
Another way to compute the mask would be as follows:
mask = np.logical_and.reduce(dat[:, 2:] != 0, axis=1)
np.logical_and.reduce shrinks the input array across the columns (axis=1) by applying np.logical_and (which is the function that processes the & operator) to the rows, so you get a True where all the elements of the selected portion of each row are True.

How to logically combine integer indices in numpy?

Does anyone know how to combine integer indices in numpy? Specifically, I've got the results of a few np.wheres and I would like to extract the elements that are common between them.
For context, I am trying to populate a large 3d array with the number of elements that are between boundary values of each cell, i.e. I have records of individual events including their time, latitude and longitude. I want to grid this into a 3D frequency matrix, where the dimensions are time, lat and lon.
I could loop round the array elements doing an np.where(timeCondition & latCondition & lonCondition), population with the length of the where result, but I figured this would be very inefficient as you would have to repeat a lot of the wheres.
What would be better is to just have a list of wheres for each of the cells in each dimension, and then loop through the logically combining them?
as #ali_m said, use bitwise and should be much faster, but to answer your question:
call ravel_multi_index() to convert the multi-dim index into 1-dim index.
call intersect1d() to get the index that in both condition.
call unravel_index() to convert the 1-dim index back to multi-dim index.
Here is the code:
import numpy as np
a = np.random.rand(10, 20, 30)
idx1 = np.where(a>0.2)
idx2 = np.where(a<0.4)
ridx1 = np.ravel_multi_index(idx1, a.shape)
ridx2 = np.ravel_multi_index(idx2, a.shape)
ridx = np.intersect1d(ridx1, ridx2)
idx = np.unravel_index(ridx, a.shape)
np.allclose(a[idx], a[(a>0.2) & (a<0.4)])
or you can use ridx directly:
a.ravel()[ridx]

Setting values in a numpy arrays indexed by a slice and two boolean arrays

I have two numpy arrays:
a = np.arange(100*100).reshape(100,100)
b = np.random.rand(100, 100)
I also have a tuple of slices to extract a certain piece of the array:
slice_ = (slice(5, 10), slice(5, 10))
I then have a set of boolean indices to select a certain part of this slice:
indices = b[slice_] > 0.5
If I want to set these indices to a different value I can do it easily:
a[slice_][indices] = 42
However, if I create another set of boolean indices to select a specific part of the indexed array:
high_indices = a[slice_][indices] > 700
and then try and set the value of the array at these indices:
a[slice_][indices][high_indices] = 42 # Doesn't do anything!
I thought maybe I needed to AND the two index arrays together, but they are different shape: indices has a shape of (5, 5) and high_indices has a shape of (12,).
I think I've got myself into a terrible muddle here trying to do something relatively simple. How can I index using these two boolean arrays in such a way that I can set the values of the array?
Slicing a numpy array returns a view, but boolean indexing returns a copy of an array. So when you indexed it first time with boolean index in a[slice_][indices][high_indices], you got back a copy, and the value 42 is assigned to a copy and not to the array a. You can solve the problem by chaining the boolean index:
a[slice_][(a[slice_] > 700) & (b[slice_] > 0.5)] = 42

Numpy signed maximum magnitude of cumsum along an axis

I have a numpy array a, a.shape=(17,90,144). I want to find the maximum magnitude of each column of cumsum(a, axis=0), but retaining the original sign. In other words, if for a given column a[:,j,i] the largest magnitude of cumsum corresponds to a negative value, I want to retain the minus sign.
The code np.amax(np.abs(a.cumsum(axis=0))) gets me the magnitude, but doesn't retain the sign. Using np.argmax instead will get me the indices I need, which I can then plug into the original cumsum array. But I can't find a good way to do so.
The following code works, but is dirty and really slow:
max_mag_signed = np.zeros((90,144))
indices = np.argmax(np.abs(a.cumsum(axis=0)), axis=0)
for j in range(90):
for i in range(144):
max_mag_signed[j,i] = a.cumsum(axis=0)[indices[j,i],j,i]
There must be a cleaner, faster way to do this. Any ideas?
I can't find any alternatives to argmax but at least you can fasten that with a more vectorized approach:
# store the cumsum, since it's used multiple times
cum_a = a.cumsum(axis=0)
# find the indices as before
indices = np.argmax(abs(cum_a), axis=0)
# construct the indices for the second and third dimensions
y, z = np.indices(indices.shape)
# get the values with np indexing
max_mag_signed = cum_a[indices, y, z]

Categories