Applying a mask for speeding up various array calculations

Applying a mask for speeding up various array calculations - python

I have a np.ndarray with numbers that indicate spots of interest, I am interested in the spots which have values 1 and 9.
Right now they are being extracted as such:
maskindex.append(np.where(extract.variables['mask'][0] == 1) or np.where(megadatalist[0].variables['mask'][0] == 9))
xval = maskindex[0][1]
yval = maskindex[0][0]
I need to apply these x and y values to the arrays that I am operating on, to speed things up.
I have 140 arrays that are each 734 x 1468, I need the mean, max, min, std calculated for each field. And I was hoping there was an easy way for applying the masked array to speed up the operations, right now I am simply doing it on the entire arrays as such:
Average_List = np.mean([megadatalist[i].variables['analysed_sst'][0] for i in range(0,Numbers_of_datasets)], axis=0)
Average_Error_List = np.mean([megadatalist[i].variables['analysis_error'][0] for i in range(0,Numbers_of_datasets)], axis=0)
Std_List = np.std([megadatalist[i].variables['analysed_sst'][0] for i in range(0,Numbers_of_datasets)], axis=0)
Maximum_List = np.maximum.reduce([megadatalist[i].variables['analysed_sst'][0] for i in range(0,Numbers_of_datasets)])
Minimum_List = np.minimum.reduce([megadatalist[i].variables['analysed_sst'][0] for i in range(0,Numbers_of_datasets)])
Any ideas on how to speed things up would be highly appreciated

I may have solved it partially, depending on what you're aiming for. The following code reduces an array arr to a 1d array with only the relevant indicies. You can then do the needed calculations without considering the unwanted locations
arr = np.array([[0,9,9,0,0,9,9,1],[9,0,1,9,0,0,0,1]])
target = [1,9] # wanted values
index = np.where(np.in1d(arr.ravel(), target).reshape(arr.shape))
no_zeros = arr[index]
At this stage "all you need" is to reinsert the values "no_zeros" on an array of zeroes with appropriate shape, on the indices given in "index". One way is to flatten the index array and recalculate the indices, so that they match a flattened arr array. Then use numpy.insert(np.zeroes(arr.shape),new_index,no_zeroes) and then reshaping to the appropriate shape afterwards. Reshaping is constant time in numpy. Admittedly, I have not figured out a fast numpy way to create the new_index array.
Hope it helps.

Related

Python: create 3D array using values of another 3D array that meet a condition

I'm basically trying to take the weighted mean of a 3D dataset, but only on a filtered subset of the data, where the filter is based off of another (2D) array. The shape of the 2D data matches the first 2 dimensions of the 3D data, and is thus repeated for each slice in the 3rd dimension.
Something like:
import numpy as np
myarr = np.array([[[4,6,8],[9,3,2]],[[2,7,4],[3,8,6]],[[1,6,7],[7,8,3]]])
myarr2 = np.array([[7,3],[6,7],[2,6]])
weights = np.random.rand(3,2,3)
filtered = []
for k in range(len(myarr[0,0,:])):
temp1 = myarr[:,:,k]
temp2 = weights[:,:,k]
filtered.append(temp1[np.where(myarr2 > 5)]*temp2[np.where(myarr2 > 5)])
average = np.array(np.sum(filtered,1)/len(filtered[0]))
I am concerned about efficiency here. Is it possible to vectorize this so I don't need the loop, or are there other suggestions to make this more efficient?

The most glaring efficiency issue, even the loop aside, is that np.where(...) is being called multiple times inside the loop, on the same condition! You can just do this a single time beforehand. Moreover, there is no need for a loop. Your operation basically equates to:
mask = myarr2 > 5
average = (myarr[mask] * weights[mask]).mean(axis=0)
There is no need for an np.where either.
myarr2 is an array of shape (i, j) with same first two dims as myarr and weight, which have some shape (i, j, k).
So if there are n True elements in the boolean mask myarr2 > 5, you can apply it on your other arrays to obtain (n, k) elements (taking all elements along third axis, when there is a True at a certain [i, j] position).

Numpy dynamic array slicing based on min/max values

I have a 3 dimensional array of hape (365, x, y) where 36 corresponds to =daily data. In some cases, all the elements along the time axis axis=0 are np.nan.
The time series for each point along the axis=0 looks something like this:
I need to find the index at which the maximum value (peak data) occurs and then the two minimum values on each side of the peak.
import numpy as np
a = np.random.random(365, 3, 3) * 10
a[:, 0, 0] = np.nan
peak_mask = np.ma.masked_array(a, np.isnan(a))
peak_indexes = np.nanargmax(peak_mask, axis=0)
I can find the minimum before the peak using something like this:
early_minimum_indexes = np.full_like(peak_indexes, fill_value=0)
for i in range(peak_indexes.shape[0]):
for j in range(peak_indexes.shape[1]):
if peak_indexes[i, j] == 0:
early_minimum_indexes[i, j] = 0
else:
early_mask = np.ma.masked_array(a, np.isnan(a))
early_loc = np.nanargmin(early_mask[:peak_indexes[i, j], i, j], axis=0)
early_minimum_indexes[i, j] = early_loc
With the resulting peak and trough plotted like this:
This approach is very unreasonable time-wise for large arrays (1m+ elements). Is there a better way to do this using numpy?

While using masked arrays may not be the most efficient solution in this case, it will allow you to perform masked operations on specific axes while more-or-less preserving shape, which is a great convenience. Keep in mind that in many cases, the masked functions will still end up copying the masked data.
You have mostly the right idea in your current code, but you missed a couple of tricks, like being able to negate and combine masks. Also the fact that allocating masks as boolean up front is more efficient, and little nitpicks like np.full(..., 0) -> np.zeros(..., dtype=bool).
Let's work through this backwards. Let's say you had a well-behaved 1-D array with a peak, say a1. You can use masking to easily find the maxima and minima (or indices) like this:
peak_index = np.nanargmax(a1)
mask = np.zeros(a1.size, dtype=np.bool)
mask[peak:] = True
trough_plus = np.nanargmin(np.ma.array(a1, mask=~mask))
trough_minus = np.nanargmin(np.ma.array(a1, mask=mask))
This respects the fact that masked arrays flip the sense of the mask relative to normal numpy boolean indexing. It's also OK that the maximum value appears in the calculation of trough_plus, since it's guaranteed not to be a minimum (unless you have the all-nan situation).
Now if a1 was a masked array already (but still 1D), you could do the same thing, but combine the masks temporarily. For example:
a1 = np.ma.array(a1, mask=np.isnan(a1))
peak_index = a1.argmax()
mask = np.zeros(a1.size, dtype=np.bool)
mask[peak:] = True
trough_plus = np.ma.masked_array(a1, mask=a.mask | ~mask).argmin()
trough_minus (np.ma.masked_array(a1, mask=a.mask | mask).argmin()
Again, since masked arrays have reversed masks, it's important to combine the masks using | instead of &, as you would for normal numpy boolean masks. In this case, there is no need for calling the nan version of argmax and argmin, since all the nans are already masked out.
Hopefully, the generalization to multiple dimensions becomes clear from here, given the prevalence of the axis keyword in numpy functions:
a = np.ma.array(a, mask=np.isnan(a))
peak_indices = a.argmax(axis=0).reshape(1, *a.shape[1:])
mask = np.arange(a.shape[0]).reshape(-1, *(1,) * (a.ndim - 1)) >= peak_indices
trough_plus = np.ma.masked_array(a, mask=~mask | a.mask).argmin(axis=0)
trough_minus = np.ma.masked_array(a, mask=mask | a.mask).argmin(axis=0)
N-dimensional masking technique comes from Fill mask efficiently based on start indices, which was asked just for this purpose.

Here is a method that
copies the data
saves all nan positions and replaces all nans with global min-1
finds the rowwise argmax
subtracts its value from the entire row
note that each row now has only non-positive values with the max value now being zero
zeros all nan positions
flips the sign of all values right of the max
this is the main idea; it creates a new row-global max at the position where before there was the right hand min; at the same time it ensures that the left hand min is now row-global
retrieves the rowwise argmin and argmax, these are the postitions of the left and right mins in the original array
finds all-nan rows and overwrites the max and min indices at these positions with INVALINT
Code:
INVALINT = -9999
t,x,y = a.shape
t,x,y = np.ogrid[:t,:x,:y]
inval = np.isnan(a)
b = np.where(inval,np.nanmin(a)-1,a)
pk = b.argmax(axis=0)
pkval = b[pk,x,y]
b -= pkval
b[inval] = 0
b[t>pk[None]] *= -1
ltr = b.argmin(axis=0)
rtr = b.argmax(axis=0)
del b
inval = inval.all(axis=0)
pk[inval] = INVALINT
ltr[inval] = INVALINT
rtr[inval] = INVALINT
# result is now in ltr ("left trough"), pk ("peak") and rtr

Taking mean of numpy ndarray with masked elements

I have a MxN array of values taken from an experiment. Some of these values are invalid and are set to 0 to indicate such. I can construct a mask of valid/invalid values using
mask = (mat1 == 0) & (mat2 == 0)
which produces an MxN array of bool. It should be noted that the masked locations do not neatly follow columns or rows of the matrix - so simply cropping the matrix is not an option.
Now, I want to take the mean along one axis of my array (E.G end up with a 1xN array) while excluding those invalid values in the mean calculation. Intuitively I thought
np.mean(mat1[mask],axis=1)
should do it, but the mat1[mask] operation produces a 1D array which appears to just be the elements where mask is true - which doesn't help when I only want a mean across one dimension of the array.
Is there a 'python-esque' or numpy way to do this? I suppose I could use the mask to set masked elements to NaN and use np.nanmean - but that still feels kind of clunky. Is there a way to do this 'cleanly'?

I think the best way to do this would be something along the lines of:
masked = np.ma.masked_where(mat1 == 0 && mat2 == 0, array_to_mask)
Then take the mean with
masked.mean(axis=1)

One similarly clunky but efficient way is to multiply your array with the mask, setting the masked values to zero. Then of course you'll have to divide by the number of non-masked values manually. Hence clunkiness. But this will work with integer-valued arrays, something that can't be said about the nan case. It also seems to be fastest for both small and larger arrays (including the masked array solution in another answer):
import numpy as np
def nanny(mat, mask):
mat = mat.astype(float).copy() # don't mutate the original
mat[~mask] = np.nan # mask values
return np.nanmean(mat, axis=0) # compute mean
def manual(mat, mask):
# zero masked values, divide by number of nonzeros
return (mat*mask).sum(axis=0)/mask.sum(axis=0)
# set up dummy data for testing
N,M = 400,400
mat1 = np.random.randint(0,N,(N,M))
mask = np.random.randint(0,2,(N,M)).astype(bool)
print(np.array_equal(nanny(mat1, mask), manual(mat1, mask))) # True

Trace max along dim1 with varying index in dim3

I have a "cube" of 3D data where there is some peak in the column, or first dimension. The index of the peak may shift depending what row is examined. The third dimension may do something a bit more complicated, but for now can be thought of as just scaling things by some linear function.
I would like to find the index of the max along the first dimension, subject to the constraint that for each row, the z index is chosen such that the column peak will be closest to 0.5.
Here's a sample image that is a plane in row,column with a fixed z:
These arrays will at times be large -- say, 21x11x200 float64s, so I would like to vectorize this calculation. Written with a for loop, it looks like this:
cols, rows, zs = data.shape
for i in range(rows):
# for each field point, make an intermediate array that is 2D with focus,frequency dimensions
arr = data[:,i,:]
# compute the thru-focus max and find the peak closest to 0.5
maxs = np.max(arr, axis=0)
max_manip = np.abs(maxs-0.5)
freq_idx = np.argmin(max_manip)
# take the thru-focus slice that peaks closest to 0.5
arr2 = data[:,i,freq_idx]
focus_idx = np.argmax(arr2)
print(focus_idx)
My issue is that I do not know how to roll these calculations up into a vector operation. I would appreciate any help, thanks!

We just need to use the axis param with the relevant ufuncs there and that would lead us to a vectorized solution, like so -
# Get freq indices along all rows in one go
idx = np.abs(data.max(0)-0.5).argmin(1)
# Index into data with those and get the argmax indices
out = data[:,np.arange(data.shape[1]), idx].argmax(0)

numpy array size vs. speed of concatenation

I am concatenating data to a numpy array like this:
xdata_test = np.concatenate((xdata_test,additional_X))
This is done a thousand times. The arrays have dtype float32, and their sizes are shown below:
xdata_test.shape : (x1,40,24,24) (x1 : [500~10500])
additional_X.shape : (x2,40,24,24) (x2 : [0 ~ 500])
The problem is that when x1 is larger than ~2000-3000, the concatenation takes a lot longer.
The graph below plots the concatenation time versus the size of the x2 dimension:
Is this a memory issue or a basic characteristic of numpy?

As far as I understand numpy, all the stack and concatenate functions are not extremely efficient. And for good reasons, because numpy tries to keep array memory contiguous for efficiency (see this link about contiguous arrays in numpy)
That means that every concatenate operation have to copy the whole data every time. When I need to concatenate a bunch of elements together I tend to do this :
l = []
for additional_X in ...:
l.append(addiional_X)
xdata_test = np.concatenate(l)
That way, the costly operation of moving the whole data is only done once.
NB : would be interested in the speed improvement that gives you.

If you have in advance the arrays you want to concatenate, I would suggest creating a new array with the total shape and fill it with the small arrays rather than concatenating, as every concatenation operation needs to copy the whole data to a new contiguous space of memory.
First, calculate the total size of the first axis:
max_x = 0
for arr in list_of_arrays:
max_x += arr.shape[0]
Second, create the end container:
final_data = np.empty((max_x,) + xdata_test.shape[1:], dtype=xdata_test.dtype)
which is equivalent to (max_x, 40, 24, 24) but dynamically typed.
Last, fill the numpy array:
curr_x = 0
for arr in list_of_arrays:
final_data[curr_x:curr_x+arr.shape[0]] = arr
curr_x += arr.shape[0]
The loop above, copies each of the arrays to a previously defined column/rows of the larger array.
By doing this, each of the N arrays will be copied to the exact final destination, rather than creating temporal arrays for each of the concatenation.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Applying a mask for speeding up various array calculations - python

Related

Python: create 3D array using values of another 3D array that meet a condition

Numpy dynamic array slicing based on min/max values

Taking mean of numpy ndarray with masked elements

Trace max along dim1 with varying index in dim3

numpy array size vs. speed of concatenation

Categories

Resources