boolean indexing in xarray - python

I have some arrays with dims 'time', 'lat', 'lon' and some with just 'lat', 'lon'. I often have to do this in order to mask time-dependent data with a 2d (lat-lon) mask:
x.data[:, mask.data] = np.nan
Of course, computations broadcast as expected. If y is 2d lat-lon data, its values are broadcast to all time coordinates in x:
z = x + y
But indexing doesn't broadcast as I'd expect. I'd like to be able to do this, but it raises ValueError: Buffer has wrong number of dimensions:
x[mask] = np.nan
Lastly, it seems that xr.where does broadcast the values of the mask across time coordinates as expected, but you can't set values this way.
x_masked = x.where(mask)
So, is there something I'm missing here that facilitates setting values using a boolean mask that is missing dimensions (and needs to be broadcast)? Is the option I provided at the top really the way to do this (in which case, I might as well just be using standard numpy arrays...)

Edit: this question is still getting upvotes, but it's now much easier - see this answer
Somewhat related question here: Concise way to filter data in xarray
Currently the best approach is a combination of .where and .fillna.
valid = date_by_items.notnull()
positive = date_by_items > 0
positive = positive * 2
result = positive.fillna(0.).where(valid)
result
But changes are coming in xarray that will make this more concise - checkout the GitHub repo if you're interested

Related

Is there a CuPy version supporting (axis) option in cupy.unique() function? Any workaround?

I'm looking for a GPU CuPy counterpart of numpy.unique() with axis option supported.
I have a Cupy 2D array that I need to remove its duplicated rows. Unfortunately, cupy.unique() function flattens the array and returns 1D array with unique values. I'm looking for a function like numpy.unique(arr, axis=0) to solve this but CuPy does not support the (axis) option yet
x = cp.array([[1,2,3,4], [4,5,6,7], [1,2,3,4], [10,11,12,13]])
y = cp.unique(x)
y_r = np.unique(cp.asnumpy(x), axis=0)
print('The 2D array:\n', x)
print('Required:\n', y_r, 'But using CuPy')
print('The flattened unique array:\n', y)
print('Error producing line:', cp.unique(x, axis=0))
I expect a 2D array with unique rows but I get a 1D array with unique numbers instead. Any ideas about how to implement this with CuPy or numba?
As of CuPy version 8.0.0b2, the function cupy.lexsort is correctly implemented. This function can be used as a workaround (albeit probably not the most efficient) for cupy.unique with the axis argument.
Assuming the array is 2D, and that you want to find the unique elements along axis 0 (else transpose/swap as appropriate):
###################################
# replacement for numpy.unique with option axis=0
###################################
def cupy_unique_axis0(array):
if len(array.shape) != 2:
raise ValueError("Input array must be 2D.")
sortarr = array[cupy.lexsort(array.T[::-1])]
mask = cupy.empty(array.shape[0], dtype=cupy.bool_)
mask[0] = True
mask[1:] = cupy.any(sortarr[1:] != sortarr[:-1], axis=1)
return sortarr[mask]
Check the original cupy.unique source code (which this is based on) if you want to implement the return_stuff arguments, too. I don't need those myself.

Taking mean of numpy ndarray with masked elements

I have a MxN array of values taken from an experiment. Some of these values are invalid and are set to 0 to indicate such. I can construct a mask of valid/invalid values using
mask = (mat1 == 0) & (mat2 == 0)
which produces an MxN array of bool. It should be noted that the masked locations do not neatly follow columns or rows of the matrix - so simply cropping the matrix is not an option.
Now, I want to take the mean along one axis of my array (E.G end up with a 1xN array) while excluding those invalid values in the mean calculation. Intuitively I thought
np.mean(mat1[mask],axis=1)
should do it, but the mat1[mask] operation produces a 1D array which appears to just be the elements where mask is true - which doesn't help when I only want a mean across one dimension of the array.
Is there a 'python-esque' or numpy way to do this? I suppose I could use the mask to set masked elements to NaN and use np.nanmean - but that still feels kind of clunky. Is there a way to do this 'cleanly'?
I think the best way to do this would be something along the lines of:
masked = np.ma.masked_where(mat1 == 0 && mat2 == 0, array_to_mask)
Then take the mean with
masked.mean(axis=1)
One similarly clunky but efficient way is to multiply your array with the mask, setting the masked values to zero. Then of course you'll have to divide by the number of non-masked values manually. Hence clunkiness. But this will work with integer-valued arrays, something that can't be said about the nan case. It also seems to be fastest for both small and larger arrays (including the masked array solution in another answer):
import numpy as np
def nanny(mat, mask):
mat = mat.astype(float).copy() # don't mutate the original
mat[~mask] = np.nan # mask values
return np.nanmean(mat, axis=0) # compute mean
def manual(mat, mask):
# zero masked values, divide by number of nonzeros
return (mat*mask).sum(axis=0)/mask.sum(axis=0)
# set up dummy data for testing
N,M = 400,400
mat1 = np.random.randint(0,N,(N,M))
mask = np.random.randint(0,2,(N,M)).astype(bool)
print(np.array_equal(nanny(mat1, mask), manual(mat1, mask))) # True

Obtaining indexes and creating an Python-based numpy array

I'm with issues translating Matlab to Python code, specially when it involves matrices / arrays.
Here, I have a 2D numpy array called output and I am computing a vector of row-major indexes t_ind of the elements that are higher than a variable vmax:
t_ind = np.flatnonzero(output > vmax)
Now I'd like to use these indexes to create a matrix based on that. In MATLAB, I could do that directly:
output(t_ind) = 2*vmax - output(t_ind);
But in Python this does not work. Specifically, I get an IndexError stating that I'm out of bounds.
I tried to figure it out, but the most elegant solution that I could think involves using np.hstack() to transform the array into a vector, compare the indexes, collect them in another variable and come back.
Could you shed some light on this?
For a 1D array, the use of np.flatnonzero is correct. Specifically, the equivalent numpy syntax would be:
output[t_ind] = 2*vmax - output[t_ind]
Also, you can achieve the same thing using Boolean operators. MATLAB also has this supported, and so if you want to translate between the two, Boolean (or logical in the MATLAB universe) is the better way to go:
output[output > vmax] = 2*vmax - output[output > vmax]
For the 2D case, you don't use np.flatnonzero. Use np.where instead:
t_ind = np.where(output > v_max)
output[t_ind] = 2*vmax - output[t_ind]
t_ind will return a tuple of numpy arrays where the first element gives you the row locations and the second element gives you the column locations of those values that satisfied the Boolean condition that is placed into np.where.
As a small note, the case for Boolean indexing still applies to any dimensions of the matrix you desire. However, np.flatnonzero would compute row-major indices of those points that satisfy the input condition into np.flatnonzero. The reason why you're getting an error is because you are trying to use row-major indices to access a 2D array. Though linear indexing is supported in Python, this is not supported in numpy - you would have to access each dimension independently to do this indexing, which is what specifying t_ind as the input indexes into output would be doing.
Numpy supports both boolean indexing and multi-dimensional indexing so you don't need to jump through all those hoops, here are two ways to get what you want:
# The setup
import numpy as np
a = np.random.random((3, 4))
vmax = 1.2
output = np.zeros(a.shape, a.dtype)
# Method one, use a boolean array to index
mask = a > .5
output[mask] = 2 * vmax - a[mask]
# Method tow, use indices to index.
t_ind = np.nonzero(a > .5)
output[t_ind] = 2 * vmax - a[t_ind]

Applying a mask for speeding up various array calculations

I have a np.ndarray with numbers that indicate spots of interest, I am interested in the spots which have values 1 and 9.
Right now they are being extracted as such:
maskindex.append(np.where(extract.variables['mask'][0] == 1) or np.where(megadatalist[0].variables['mask'][0] == 9))
xval = maskindex[0][1]
yval = maskindex[0][0]
I need to apply these x and y values to the arrays that I am operating on, to speed things up.
I have 140 arrays that are each 734 x 1468, I need the mean, max, min, std calculated for each field. And I was hoping there was an easy way for applying the masked array to speed up the operations, right now I am simply doing it on the entire arrays as such:
Average_List = np.mean([megadatalist[i].variables['analysed_sst'][0] for i in range(0,Numbers_of_datasets)], axis=0)
Average_Error_List = np.mean([megadatalist[i].variables['analysis_error'][0] for i in range(0,Numbers_of_datasets)], axis=0)
Std_List = np.std([megadatalist[i].variables['analysed_sst'][0] for i in range(0,Numbers_of_datasets)], axis=0)
Maximum_List = np.maximum.reduce([megadatalist[i].variables['analysed_sst'][0] for i in range(0,Numbers_of_datasets)])
Minimum_List = np.minimum.reduce([megadatalist[i].variables['analysed_sst'][0] for i in range(0,Numbers_of_datasets)])
Any ideas on how to speed things up would be highly appreciated
I may have solved it partially, depending on what you're aiming for. The following code reduces an array arr to a 1d array with only the relevant indicies. You can then do the needed calculations without considering the unwanted locations
arr = np.array([[0,9,9,0,0,9,9,1],[9,0,1,9,0,0,0,1]])
target = [1,9] # wanted values
index = np.where(np.in1d(arr.ravel(), target).reshape(arr.shape))
no_zeros = arr[index]
At this stage "all you need" is to reinsert the values "no_zeros" on an array of zeroes with appropriate shape, on the indices given in "index". One way is to flatten the index array and recalculate the indices, so that they match a flattened arr array. Then use numpy.insert(np.zeroes(arr.shape),new_index,no_zeroes) and then reshaping to the appropriate shape afterwards. Reshaping is constant time in numpy. Admittedly, I have not figured out a fast numpy way to create the new_index array.
Hope it helps.

How to create the histogram of an array with masked values, in Numpy?

In Numpy 1.4.1, what is the simplest or most efficient way of calculating the histogram of a masked array? numpy.histogram and pyplot.hist do count the masked elements, by default!
The only simple solution I can think of right now involves creating a new array with the non-masked value:
histogram(m_arr[~m_arr.mask])
This is not very efficient, though, as this unnecessarily creates a new array. I'd be happy to read about better ideas!
(Undeleting this as per discussion above...)
I'm not sure whether or not the numpy developers would consider this a bug or expected behavior. I asked on the mailing list, so I guess we'll see what they say.
Either way, it's an easy fix. Patching numpy/lib/function_base.py to use numpy.asanyarray rather than numpy.asarray on the inputs to the function will allow it to properly use masked arrays (or any other subclass of an ndarray) without creating a copy.
Edit: It seems like it is expected behavior. As discussed here:
If you want to ignore masked data it's
just on extra function call
histogram(m_arr.compressed())
I don't think the fact that this makes
an extra copy will be relevant,
because I guess full masked array
handling inside histogram will be a
lot more expensive.
Using asanyarray would also allow
matrices in and other subtypes that
might not be handled correctly by the
histogram calculations.
For anything else besides dropping
masked observations, it would be
necessary to figure out what the
masked array definition of a histogram
is, as Bruce pointed out.
Try hist(m_arr.compressed()).
This is a super old question, but these days I just use:
numpy.histogram(m_arr, bins=.., range=.., density=False, weights=m_arr_mask)
Where m_arr_mask is an array with the same shape as m_arr, consisting of 0 values for elements of m_arr to be excluded from the histogram and 1 values for elements that are to be included.
After running into casting issues by trying Erik's solution (see https://github.com/numpy/numpy/issues/16616), I decided to write a numba function to achieve this behavior.
Some of the code was inspired by https://numba.pydata.org/numba-examples/examples/density_estimation/histogram/results.html. I added the mask bit.
import numpy
import numba
#numba.jit(nopython=True)
def compute_bin(x, bin_edges):
# assuming uniform bins for now
n = bin_edges.shape[0] - 1
a_min = bin_edges[0]
a_max = bin_edges[-1]
# special case to mirror NumPy behavior for last bin
if x == a_max:
return n - 1 # a_max always in last bin
bin = int(n * (x - a_min) / (a_max - a_min))
if bin < 0 or bin >= n:
return None
else:
return bin
#numba.jit(nopython=True)
def masked_histogram(img, bin_edges, mask):
hist = numpy.zeros(len(bin_edges) - 1, dtype=numpy.intp)
for i, value in enumerate(img.flat):
if mask.flat[i]:
bin = compute_bin(value, bin_edges)
if bin is not None:
hist[int(bin)] += 1
return hist # , bin_edges
The speedup is significant. On a (1000, 1000) image:

Categories