faster deleting/masking few entries from a large index array - python

I have an index manipulation problem that I can solve, but it's slow. I'm looking for a way to speed this up.
I have a large (m, 2) float array (think 2D point coordinates) and a large idx array into points. (A typical operation is to pick indices out, points[idx].) From idx, I sometimes need to delete a few entries. I can do that with mask, but the operation is slow, presumably because the entire array is rewritten in memory. Alternative: numpy's masked arrays. Masking is fast, of course, but unfortunately they don't work as indices: the masking is simply ignored.
MWE:
import numpy
# setup
points = numpy.random.rand(10, 2)
n = 5 # can be very large irl
idx = numpy.random.randint(0, 10, n)
# typical operation with idx:
# points[idx]
# a few entries are deleted
mask = numpy.ones(n, dtype=bool)
mask[2] = False # only a few are masked
idx = idx[mask] # takes a while
# alternative: use ma?
idx = numpy.random.randint(0, 10, n)
idx = numpy.ma.array(idx)
idx[2] = numpy.ma.masked
# Doesn't work, masking is ignored:
points[idx]
Any hints on how to speed this up?

Related

Append Value to 3D array numpy

I iterate over a 3D numpy array and want to append in every step a float value to the array in the 3rd dimension (axis =2).
Something like (I know the code doesn't work as of now, latIndex, data and lonIndex for simplicity as randoms)
import numpy as np
import random
GridData = np.ones((121, 201, 1000))
data = np.random.rand(4800, 4800)
for row in range(4800):
for column in range(4800):
latIndex = random.randrange(0, 121, 1)
lonIndex = random.randrange(0, 201, 1)
GridData = np.append(GridData[latIndex, lonIndex, :], data[column, row], axis = 2)
The 3rd dimension of GridData is arbitrary in this example of size 1000.
How can I achieve this?
Addition:
It might be possible without np.append but then I don't know how to do this since the 3rd index is different for every combination of latIndex and lonIndex.
You can allocate extra space for your array grid_data, fill it with the NaN, and keep track of the next index to be filled in another array while iterating through and filling with values from data. If you completely fill the third dimension for some lat_idx, lon_idx with non-NaN values, then you just allocate more space. Since appending is expensive with numpy, it's best that this extra space is pretty large so you only do it once or twice (below I allocate twice the original space).
Once the array is filled, you can remove the added space that was unused with numpy.isnan(). This solution does what you want but is very slow (for the example values you gave it took about two minutes), but the slow execution comes from iterating rather than the numpy operations.
Here's the code:
import random
import numpy as np
grid_data = np.ones(shape=(121, 201, 1000))
data = np.random.rand(4800, 4800)
# keep track of next index to fill for all the arrays in axis 2
next_to_fill = np.full(shape=(grid_data.shape[0], grid_data.shape[1]),
fill_value=grid_data.shape[2],
dtype=np.int32)
# allocate more space
double_shape = (grid_data.shape[0], grid_data.shape[1], grid_data.shape[2] * 2)
extra_space = np.full(shape=double_shape, fill_value=np.nan)
grid_data = np.append(grid_data, extra_space, axis=2)
for row in range(4800):
for col in range(4800):
lat_idx = random.randint(0, 120)
lon_idx = random.randint(0, 200)
# allocate more space if needed
if next_to_fill[lat_idx, lon_idx] >= grid_data.shape[2]:
grid_data = np.append(grid_data, extra_space, axis=2)
grid_data[lat_idx, lon_idx, next_to_fill[lat_idx, lon_idx]] = data[row,
col]
next_to_fill[lat_idx, lon_idx] += 1
# remove unnecessary nans that were appended
not_all_nan_idxs = ~np.isnan(grid_data).all(axis=(0, 1))
grid_data = grid_data[:, :, not_all_nan_idxs]

Numpy dynamic array slicing based on min/max values

I have a 3 dimensional array of hape (365, x, y) where 36 corresponds to =daily data. In some cases, all the elements along the time axis axis=0 are np.nan.
The time series for each point along the axis=0 looks something like this:
I need to find the index at which the maximum value (peak data) occurs and then the two minimum values on each side of the peak.
import numpy as np
a = np.random.random(365, 3, 3) * 10
a[:, 0, 0] = np.nan
peak_mask = np.ma.masked_array(a, np.isnan(a))
peak_indexes = np.nanargmax(peak_mask, axis=0)
I can find the minimum before the peak using something like this:
early_minimum_indexes = np.full_like(peak_indexes, fill_value=0)
for i in range(peak_indexes.shape[0]):
for j in range(peak_indexes.shape[1]):
if peak_indexes[i, j] == 0:
early_minimum_indexes[i, j] = 0
else:
early_mask = np.ma.masked_array(a, np.isnan(a))
early_loc = np.nanargmin(early_mask[:peak_indexes[i, j], i, j], axis=0)
early_minimum_indexes[i, j] = early_loc
With the resulting peak and trough plotted like this:
This approach is very unreasonable time-wise for large arrays (1m+ elements). Is there a better way to do this using numpy?
While using masked arrays may not be the most efficient solution in this case, it will allow you to perform masked operations on specific axes while more-or-less preserving shape, which is a great convenience. Keep in mind that in many cases, the masked functions will still end up copying the masked data.
You have mostly the right idea in your current code, but you missed a couple of tricks, like being able to negate and combine masks. Also the fact that allocating masks as boolean up front is more efficient, and little nitpicks like np.full(..., 0) -> np.zeros(..., dtype=bool).
Let's work through this backwards. Let's say you had a well-behaved 1-D array with a peak, say a1. You can use masking to easily find the maxima and minima (or indices) like this:
peak_index = np.nanargmax(a1)
mask = np.zeros(a1.size, dtype=np.bool)
mask[peak:] = True
trough_plus = np.nanargmin(np.ma.array(a1, mask=~mask))
trough_minus = np.nanargmin(np.ma.array(a1, mask=mask))
This respects the fact that masked arrays flip the sense of the mask relative to normal numpy boolean indexing. It's also OK that the maximum value appears in the calculation of trough_plus, since it's guaranteed not to be a minimum (unless you have the all-nan situation).
Now if a1 was a masked array already (but still 1D), you could do the same thing, but combine the masks temporarily. For example:
a1 = np.ma.array(a1, mask=np.isnan(a1))
peak_index = a1.argmax()
mask = np.zeros(a1.size, dtype=np.bool)
mask[peak:] = True
trough_plus = np.ma.masked_array(a1, mask=a.mask | ~mask).argmin()
trough_minus (np.ma.masked_array(a1, mask=a.mask | mask).argmin()
Again, since masked arrays have reversed masks, it's important to combine the masks using | instead of &, as you would for normal numpy boolean masks. In this case, there is no need for calling the nan version of argmax and argmin, since all the nans are already masked out.
Hopefully, the generalization to multiple dimensions becomes clear from here, given the prevalence of the axis keyword in numpy functions:
a = np.ma.array(a, mask=np.isnan(a))
peak_indices = a.argmax(axis=0).reshape(1, *a.shape[1:])
mask = np.arange(a.shape[0]).reshape(-1, *(1,) * (a.ndim - 1)) >= peak_indices
trough_plus = np.ma.masked_array(a, mask=~mask | a.mask).argmin(axis=0)
trough_minus = np.ma.masked_array(a, mask=mask | a.mask).argmin(axis=0)
N-dimensional masking technique comes from Fill mask efficiently based on start indices, which was asked just for this purpose.
Here is a method that
copies the data
saves all nan positions and replaces all nans with global min-1
finds the rowwise argmax
subtracts its value from the entire row
note that each row now has only non-positive values with the max value now being zero
zeros all nan positions
flips the sign of all values right of the max
this is the main idea; it creates a new row-global max at the position where before there was the right hand min; at the same time it ensures that the left hand min is now row-global
retrieves the rowwise argmin and argmax, these are the postitions of the left and right mins in the original array
finds all-nan rows and overwrites the max and min indices at these positions with INVALINT
Code:
INVALINT = -9999
t,x,y = a.shape
t,x,y = np.ogrid[:t,:x,:y]
inval = np.isnan(a)
b = np.where(inval,np.nanmin(a)-1,a)
pk = b.argmax(axis=0)
pkval = b[pk,x,y]
b -= pkval
b[inval] = 0
b[t>pk[None]] *= -1
ltr = b.argmin(axis=0)
rtr = b.argmax(axis=0)
del b
inval = inval.all(axis=0)
pk[inval] = INVALINT
ltr[inval] = INVALINT
rtr[inval] = INVALINT
# result is now in ltr ("left trough"), pk ("peak") and rtr

Scipy: Sparse indicator matrix from array(s)

What is the most efficient way to compute a sparse boolean matrix I from one or two arrays a,b, with I[i,j]==True where a[i]==b[j]? The following is fast but memory-inefficient:
I = a[:,None]==b
The following is slow and still memory-inefficient during creation:
I = csr((a[:,None]==b),shape=(len(a),len(b)))
The following gives at least the rows,cols for better csr_matrix initialization, but it still creates the full dense matrix and is equally slow:
z = np.argwhere((a[:,None]==b))
Any ideas?
One way to do it would be to first identify all different elements that a and b have in common using sets. This should work well if there are not very many different possibilities for the values in a and b. One then would only have to loop over the different values (below in variable values) and use np.argwhere to identify the indices in a and b where these values occur. The 2D indices of the sparse matrix can then be constructed using np.repeat and np.tile:
import numpy as np
from scipy import sparse
a = np.random.randint(0, 10, size=(400,))
b = np.random.randint(0, 10, size=(300,))
## matrix generation after OP
I1 = sparse.csr_matrix((a[:,None]==b),shape=(len(a),len(b)))
##identifying all values that occur both in a and b:
values = set(np.unique(a)) & set(np.unique(b))
##here we collect the indices in a and b where the respective values are the same:
rows, cols = [], []
##looping over the common values, finding their indices in a and b, and
##generating the 2D indices of the sparse matrix with np.repeat and np.tile
for value in values:
x = np.argwhere(a==value).ravel()
y = np.argwhere(b==value).ravel()
rows.append(np.repeat(x, len(x)))
cols.append(np.tile(y, len(y)))
##concatenating the indices for different values and generating a 1D vector
##of True values for final matrix generation
rows = np.hstack(rows)
cols = np.hstack(cols)
data = np.ones(len(rows),dtype=bool)
##generating sparse matrix
I3 = sparse.csr_matrix( (data,(rows,cols)), shape=(len(a),len(b)) )
##checking that the matrix was generated correctly:
print((I1 != I3).nnz==0)
The syntax for generating the csr matrix is taken from the documentation. The test for sparse matrix equality is taken from this post.
Old Answer:
I don't know about performance, but at least you can avoid constructing the full dense matrix by using a simple generator expression. Here some code that uses two 1d arras of random integers to first generate the sparse matrix the way that the OP posted and then uses a generator expression to test all elements for equality:
import numpy as np
from scipy import sparse
a = np.random.randint(0, 10, size=(400,))
b = np.random.randint(0, 10, size=(300,))
## matrix generation after OP
I1 = sparse.csr_matrix((a[:,None]==b),shape=(len(a),len(b)))
## matrix generation using generator
data, rows, cols = zip(
*((True, i, j) for i,A in enumerate(a) for j,B in enumerate(b) if A==B)
)
I2 = sparse.csr_matrix((data, (rows, cols)), shape=(len(a), len(b)))
##testing that matrices are equal
## from https://stackoverflow.com/a/30685839/2454357
print((I1 != I2).nnz==0) ## --> True
I think there is no way around the double loop and ideally this would be pushed into numpy, but at least with the generator the loops are somewhat optimised ...
You could use numpy.isclose with small tolerance:
np.isclose(a,b)
Or pandas.DataFrame.eq:
a.eq(b)
Note this returns an array of True False.

Efficient way to perform a 2D x 1D Matrix Multiply

I am trying to perform a 2D by 1D matrix multiply. Specifically:
import numpy as np
s = np.ones(268)
one_d = np.ones(9422700)
s_newaxis = s[:, np.newaxis]
goal = s_newaxis * one_d
While the dimensions above are the same as my problem ((268, 1) and (9422700,)), the actual values in my arrays are a mix of very large and very small numbers. As a result I can run goal = s_newaxis * one_d because only 1s exist. However, I run out of ram using my actual data.
I recognize that, at the end of the day, this amounts to a matrix with ~2.5 billion values and so a heavy memory footprint is to be expected. However, any improvement in terms of efficiency would be welcome.
For completeness, I've included a rough attempt. It is not very elegant, but it is just enough of an improvement that it won't crash my computer (admittedly a low bar).
import gc
def arange(start, stop, step):
# `arange` which includes the endpoint (`stop`).
arr = np.arange(start=start, stop=stop, step=step)
if arr[-1] < stop:
return np.append(arr, [stop])
else:
return arr
left, all_arrays = 0, list()
for right in arange(0, stop=s_newaxis.shape[0], step=10):
chunk = s_newaxis[left:right,:] * one_d
all_arrays.append(chunk)
left = right
gc.collect() # unclear if this makes any difference...I suspect not.
goal = np.vstack(all_arrays)

Applying a mask for speeding up various array calculations

I have a np.ndarray with numbers that indicate spots of interest, I am interested in the spots which have values 1 and 9.
Right now they are being extracted as such:
maskindex.append(np.where(extract.variables['mask'][0] == 1) or np.where(megadatalist[0].variables['mask'][0] == 9))
xval = maskindex[0][1]
yval = maskindex[0][0]
I need to apply these x and y values to the arrays that I am operating on, to speed things up.
I have 140 arrays that are each 734 x 1468, I need the mean, max, min, std calculated for each field. And I was hoping there was an easy way for applying the masked array to speed up the operations, right now I am simply doing it on the entire arrays as such:
Average_List = np.mean([megadatalist[i].variables['analysed_sst'][0] for i in range(0,Numbers_of_datasets)], axis=0)
Average_Error_List = np.mean([megadatalist[i].variables['analysis_error'][0] for i in range(0,Numbers_of_datasets)], axis=0)
Std_List = np.std([megadatalist[i].variables['analysed_sst'][0] for i in range(0,Numbers_of_datasets)], axis=0)
Maximum_List = np.maximum.reduce([megadatalist[i].variables['analysed_sst'][0] for i in range(0,Numbers_of_datasets)])
Minimum_List = np.minimum.reduce([megadatalist[i].variables['analysed_sst'][0] for i in range(0,Numbers_of_datasets)])
Any ideas on how to speed things up would be highly appreciated
I may have solved it partially, depending on what you're aiming for. The following code reduces an array arr to a 1d array with only the relevant indicies. You can then do the needed calculations without considering the unwanted locations
arr = np.array([[0,9,9,0,0,9,9,1],[9,0,1,9,0,0,0,1]])
target = [1,9] # wanted values
index = np.where(np.in1d(arr.ravel(), target).reshape(arr.shape))
no_zeros = arr[index]
At this stage "all you need" is to reinsert the values "no_zeros" on an array of zeroes with appropriate shape, on the indices given in "index". One way is to flatten the index array and recalculate the indices, so that they match a flattened arr array. Then use numpy.insert(np.zeroes(arr.shape),new_index,no_zeroes) and then reshaping to the appropriate shape afterwards. Reshaping is constant time in numpy. Admittedly, I have not figured out a fast numpy way to create the new_index array.
Hope it helps.

Categories