I have a 2D numpy array with shape (2,50,000) - which means I have 50k samples of x,y values.
I want to filter the x,y values within a certain range, lets say:
min < x,y < max
I tried using np.apply_along_axis with a filter function, but I couldn't make it work.
I'd love to get to see some pythonic way of executing this simple task!
With your array being arr and your values being (_min, _max), use:
selection = np.logical_and(_min <= arr, arr <= _max)
selection = np.logical_and(selection[0], selection[1])
filtered_arr = arr[:, selection]
Related
I'm basically trying to take the weighted mean of a 3D dataset, but only on a filtered subset of the data, where the filter is based off of another (2D) array. The shape of the 2D data matches the first 2 dimensions of the 3D data, and is thus repeated for each slice in the 3rd dimension.
Something like:
import numpy as np
myarr = np.array([[[4,6,8],[9,3,2]],[[2,7,4],[3,8,6]],[[1,6,7],[7,8,3]]])
myarr2 = np.array([[7,3],[6,7],[2,6]])
weights = np.random.rand(3,2,3)
filtered = []
for k in range(len(myarr[0,0,:])):
temp1 = myarr[:,:,k]
temp2 = weights[:,:,k]
filtered.append(temp1[np.where(myarr2 > 5)]*temp2[np.where(myarr2 > 5)])
average = np.array(np.sum(filtered,1)/len(filtered[0]))
I am concerned about efficiency here. Is it possible to vectorize this so I don't need the loop, or are there other suggestions to make this more efficient?
The most glaring efficiency issue, even the loop aside, is that np.where(...) is being called multiple times inside the loop, on the same condition! You can just do this a single time beforehand. Moreover, there is no need for a loop. Your operation basically equates to:
mask = myarr2 > 5
average = (myarr[mask] * weights[mask]).mean(axis=0)
There is no need for an np.where either.
myarr2 is an array of shape (i, j) with same first two dims as myarr and weight, which have some shape (i, j, k).
So if there are n True elements in the boolean mask myarr2 > 5, you can apply it on your other arrays to obtain (n, k) elements (taking all elements along third axis, when there is a True at a certain [i, j] position).
I've an image of about 8000x9000 size as a numpy matrix. I also have a list of indices in a numpy 2xn matrix. These indices are fractional as well as may be out of image size. I need to interpolate the image and find the values for the given indices. If the indices fall outside, I need to return numpy.nan for them. Currently I'm doing it in for loop as below
def interpolate_image(image: numpy.ndarray, indices: numpy.ndarray) -> numpy.ndarray:
"""
:param image:
:param indices: 2xN matrix. 1st row is dim1 (rows) indices, 2nd row is dim2 (cols) indices
:return:
"""
# Todo: Vectorize this
M, N = image.shape
num_indices = indices.shape[1]
interpolated_image = numpy.zeros((1, num_indices))
for i in range(num_indices):
x, y = indices[:, i]
if (x < 0 or x > M - 1) or (y < 0 or y > N - 1):
interpolated_image[0, i] = numpy.nan
else:
# Todo: Do Bilinear Interpolation. For now nearest neighbor is implemented
interpolated_image[0, i] = image[int(round(x)), int(round(y))]
return interpolated_image
But the for loop is taking huge amount of time (as expected). How can I vectorize this? I found scipy.interpolate.interp2d, but I'm not able to use it. Can someone explain how to use this or any other method is also fine. I also found this, but again it is not according to my requirements. Given x and y indices, these generated interpolated matrices. I don't want that. For the given indices, I just want the interpolated values i.e. I need a vector output. Not a matrix.
I tried like this, but as said above, it gives a matrix output
f = interpolate.interp2d(numpy.arange(image.shape[0]), numpy.arange(image.shape[1]), image, kind='linear')
interp_image_vect = f(indices[:,0], indices[:,1])
RuntimeError: Cannot produce output of size 73156608x73156608 (size too large)
For now, I've implemented nearest-neighbor interpolation. scipy interp2d doesn't have nearest neighbor. It would be good if the library function as nearest neighbor (so I can compare). If not, then also fine.
It looks like scipy.interpolate.RectBivariateSpline will do the trick:
from scipy.interpolate import RectBivariateSpline
image = # as given
indices = # as given
spline = RectBivariateSpline(numpy.arange(M), numpy.arange(N), image)
interpolated = spline(indices[0], indices[1], grid=False)
This gets you the interpolated values, but it doesn't give you nan where you need it. You can get that with where:
nans = numpy.zeros(interpolated.shape) + numpy.nan
x_in_bounds = (0 <= indices[0]) & (indices[0] < M)
y_in_bounds = (0 <= indices[1]) & (indices[1] < N)
bounded = numpy.where(x_in_bounds & y_in_bounds, interpolated, nans)
I tested this with a 2624x2624 image and 100,000 points in indices and all told it took under a second.
I have a 3d array where all axis lengths are the same (for example (5,5,5)). I need to mask all of the array and keep certain slices in the array unmasked as per the code below. I managed to accomplish this using a for loop but I wondered if there was a faster solution out there.
array = np.reshape(np.array(np.random.rand(125)), (5,5,5))
array = ma.array(array, mask=True)
for i in range(array.shape[0]):
for j in range(array.shape[1]):
array[i, j, :].mask[i:j] = False
This allows me to sum this array with another array of the same size while ignoring the masked values.
You can create the entire mask in one step using broadcasting:
i, j, k = np.ogrid[:5, :5, :5]
mask = (i>k) | (k>=j)
I'm with issues translating Matlab to Python code, specially when it involves matrices / arrays.
Here, I have a 2D numpy array called output and I am computing a vector of row-major indexes t_ind of the elements that are higher than a variable vmax:
t_ind = np.flatnonzero(output > vmax)
Now I'd like to use these indexes to create a matrix based on that. In MATLAB, I could do that directly:
output(t_ind) = 2*vmax - output(t_ind);
But in Python this does not work. Specifically, I get an IndexError stating that I'm out of bounds.
I tried to figure it out, but the most elegant solution that I could think involves using np.hstack() to transform the array into a vector, compare the indexes, collect them in another variable and come back.
Could you shed some light on this?
For a 1D array, the use of np.flatnonzero is correct. Specifically, the equivalent numpy syntax would be:
output[t_ind] = 2*vmax - output[t_ind]
Also, you can achieve the same thing using Boolean operators. MATLAB also has this supported, and so if you want to translate between the two, Boolean (or logical in the MATLAB universe) is the better way to go:
output[output > vmax] = 2*vmax - output[output > vmax]
For the 2D case, you don't use np.flatnonzero. Use np.where instead:
t_ind = np.where(output > v_max)
output[t_ind] = 2*vmax - output[t_ind]
t_ind will return a tuple of numpy arrays where the first element gives you the row locations and the second element gives you the column locations of those values that satisfied the Boolean condition that is placed into np.where.
As a small note, the case for Boolean indexing still applies to any dimensions of the matrix you desire. However, np.flatnonzero would compute row-major indices of those points that satisfy the input condition into np.flatnonzero. The reason why you're getting an error is because you are trying to use row-major indices to access a 2D array. Though linear indexing is supported in Python, this is not supported in numpy - you would have to access each dimension independently to do this indexing, which is what specifying t_ind as the input indexes into output would be doing.
Numpy supports both boolean indexing and multi-dimensional indexing so you don't need to jump through all those hoops, here are two ways to get what you want:
# The setup
import numpy as np
a = np.random.random((3, 4))
vmax = 1.2
output = np.zeros(a.shape, a.dtype)
# Method one, use a boolean array to index
mask = a > .5
output[mask] = 2 * vmax - a[mask]
# Method tow, use indices to index.
t_ind = np.nonzero(a > .5)
output[t_ind] = 2 * vmax - a[t_ind]
I have a np.ndarray with numbers that indicate spots of interest, I am interested in the spots which have values 1 and 9.
Right now they are being extracted as such:
maskindex.append(np.where(extract.variables['mask'][0] == 1) or np.where(megadatalist[0].variables['mask'][0] == 9))
xval = maskindex[0][1]
yval = maskindex[0][0]
I need to apply these x and y values to the arrays that I am operating on, to speed things up.
I have 140 arrays that are each 734 x 1468, I need the mean, max, min, std calculated for each field. And I was hoping there was an easy way for applying the masked array to speed up the operations, right now I am simply doing it on the entire arrays as such:
Average_List = np.mean([megadatalist[i].variables['analysed_sst'][0] for i in range(0,Numbers_of_datasets)], axis=0)
Average_Error_List = np.mean([megadatalist[i].variables['analysis_error'][0] for i in range(0,Numbers_of_datasets)], axis=0)
Std_List = np.std([megadatalist[i].variables['analysed_sst'][0] for i in range(0,Numbers_of_datasets)], axis=0)
Maximum_List = np.maximum.reduce([megadatalist[i].variables['analysed_sst'][0] for i in range(0,Numbers_of_datasets)])
Minimum_List = np.minimum.reduce([megadatalist[i].variables['analysed_sst'][0] for i in range(0,Numbers_of_datasets)])
Any ideas on how to speed things up would be highly appreciated
I may have solved it partially, depending on what you're aiming for. The following code reduces an array arr to a 1d array with only the relevant indicies. You can then do the needed calculations without considering the unwanted locations
arr = np.array([[0,9,9,0,0,9,9,1],[9,0,1,9,0,0,0,1]])
target = [1,9] # wanted values
index = np.where(np.in1d(arr.ravel(), target).reshape(arr.shape))
no_zeros = arr[index]
At this stage "all you need" is to reinsert the values "no_zeros" on an array of zeroes with appropriate shape, on the indices given in "index". One way is to flatten the index array and recalculate the indices, so that they match a flattened arr array. Then use numpy.insert(np.zeroes(arr.shape),new_index,no_zeroes) and then reshaping to the appropriate shape afterwards. Reshaping is constant time in numpy. Admittedly, I have not figured out a fast numpy way to create the new_index array.
Hope it helps.