I've an image of about 8000x9000 size as a numpy matrix. I also have a list of indices in a numpy 2xn matrix. These indices are fractional as well as may be out of image size. I need to interpolate the image and find the values for the given indices. If the indices fall outside, I need to return numpy.nan for them. Currently I'm doing it in for loop as below
def interpolate_image(image: numpy.ndarray, indices: numpy.ndarray) -> numpy.ndarray:
"""
:param image:
:param indices: 2xN matrix. 1st row is dim1 (rows) indices, 2nd row is dim2 (cols) indices
:return:
"""
# Todo: Vectorize this
M, N = image.shape
num_indices = indices.shape[1]
interpolated_image = numpy.zeros((1, num_indices))
for i in range(num_indices):
x, y = indices[:, i]
if (x < 0 or x > M - 1) or (y < 0 or y > N - 1):
interpolated_image[0, i] = numpy.nan
else:
# Todo: Do Bilinear Interpolation. For now nearest neighbor is implemented
interpolated_image[0, i] = image[int(round(x)), int(round(y))]
return interpolated_image
But the for loop is taking huge amount of time (as expected). How can I vectorize this? I found scipy.interpolate.interp2d, but I'm not able to use it. Can someone explain how to use this or any other method is also fine. I also found this, but again it is not according to my requirements. Given x and y indices, these generated interpolated matrices. I don't want that. For the given indices, I just want the interpolated values i.e. I need a vector output. Not a matrix.
I tried like this, but as said above, it gives a matrix output
f = interpolate.interp2d(numpy.arange(image.shape[0]), numpy.arange(image.shape[1]), image, kind='linear')
interp_image_vect = f(indices[:,0], indices[:,1])
RuntimeError: Cannot produce output of size 73156608x73156608 (size too large)
For now, I've implemented nearest-neighbor interpolation. scipy interp2d doesn't have nearest neighbor. It would be good if the library function as nearest neighbor (so I can compare). If not, then also fine.
It looks like scipy.interpolate.RectBivariateSpline will do the trick:
from scipy.interpolate import RectBivariateSpline
image = # as given
indices = # as given
spline = RectBivariateSpline(numpy.arange(M), numpy.arange(N), image)
interpolated = spline(indices[0], indices[1], grid=False)
This gets you the interpolated values, but it doesn't give you nan where you need it. You can get that with where:
nans = numpy.zeros(interpolated.shape) + numpy.nan
x_in_bounds = (0 <= indices[0]) & (indices[0] < M)
y_in_bounds = (0 <= indices[1]) & (indices[1] < N)
bounded = numpy.where(x_in_bounds & y_in_bounds, interpolated, nans)
I tested this with a 2624x2624 image and 100,000 points in indices and all told it took under a second.
Related
I'm basically trying to take the weighted mean of a 3D dataset, but only on a filtered subset of the data, where the filter is based off of another (2D) array. The shape of the 2D data matches the first 2 dimensions of the 3D data, and is thus repeated for each slice in the 3rd dimension.
Something like:
import numpy as np
myarr = np.array([[[4,6,8],[9,3,2]],[[2,7,4],[3,8,6]],[[1,6,7],[7,8,3]]])
myarr2 = np.array([[7,3],[6,7],[2,6]])
weights = np.random.rand(3,2,3)
filtered = []
for k in range(len(myarr[0,0,:])):
temp1 = myarr[:,:,k]
temp2 = weights[:,:,k]
filtered.append(temp1[np.where(myarr2 > 5)]*temp2[np.where(myarr2 > 5)])
average = np.array(np.sum(filtered,1)/len(filtered[0]))
I am concerned about efficiency here. Is it possible to vectorize this so I don't need the loop, or are there other suggestions to make this more efficient?
The most glaring efficiency issue, even the loop aside, is that np.where(...) is being called multiple times inside the loop, on the same condition! You can just do this a single time beforehand. Moreover, there is no need for a loop. Your operation basically equates to:
mask = myarr2 > 5
average = (myarr[mask] * weights[mask]).mean(axis=0)
There is no need for an np.where either.
myarr2 is an array of shape (i, j) with same first two dims as myarr and weight, which have some shape (i, j, k).
So if there are n True elements in the boolean mask myarr2 > 5, you can apply it on your other arrays to obtain (n, k) elements (taking all elements along third axis, when there is a True at a certain [i, j] position).
I have a 3 dimensional array of hape (365, x, y) where 36 corresponds to =daily data. In some cases, all the elements along the time axis axis=0 are np.nan.
The time series for each point along the axis=0 looks something like this:
I need to find the index at which the maximum value (peak data) occurs and then the two minimum values on each side of the peak.
import numpy as np
a = np.random.random(365, 3, 3) * 10
a[:, 0, 0] = np.nan
peak_mask = np.ma.masked_array(a, np.isnan(a))
peak_indexes = np.nanargmax(peak_mask, axis=0)
I can find the minimum before the peak using something like this:
early_minimum_indexes = np.full_like(peak_indexes, fill_value=0)
for i in range(peak_indexes.shape[0]):
for j in range(peak_indexes.shape[1]):
if peak_indexes[i, j] == 0:
early_minimum_indexes[i, j] = 0
else:
early_mask = np.ma.masked_array(a, np.isnan(a))
early_loc = np.nanargmin(early_mask[:peak_indexes[i, j], i, j], axis=0)
early_minimum_indexes[i, j] = early_loc
With the resulting peak and trough plotted like this:
This approach is very unreasonable time-wise for large arrays (1m+ elements). Is there a better way to do this using numpy?
While using masked arrays may not be the most efficient solution in this case, it will allow you to perform masked operations on specific axes while more-or-less preserving shape, which is a great convenience. Keep in mind that in many cases, the masked functions will still end up copying the masked data.
You have mostly the right idea in your current code, but you missed a couple of tricks, like being able to negate and combine masks. Also the fact that allocating masks as boolean up front is more efficient, and little nitpicks like np.full(..., 0) -> np.zeros(..., dtype=bool).
Let's work through this backwards. Let's say you had a well-behaved 1-D array with a peak, say a1. You can use masking to easily find the maxima and minima (or indices) like this:
peak_index = np.nanargmax(a1)
mask = np.zeros(a1.size, dtype=np.bool)
mask[peak:] = True
trough_plus = np.nanargmin(np.ma.array(a1, mask=~mask))
trough_minus = np.nanargmin(np.ma.array(a1, mask=mask))
This respects the fact that masked arrays flip the sense of the mask relative to normal numpy boolean indexing. It's also OK that the maximum value appears in the calculation of trough_plus, since it's guaranteed not to be a minimum (unless you have the all-nan situation).
Now if a1 was a masked array already (but still 1D), you could do the same thing, but combine the masks temporarily. For example:
a1 = np.ma.array(a1, mask=np.isnan(a1))
peak_index = a1.argmax()
mask = np.zeros(a1.size, dtype=np.bool)
mask[peak:] = True
trough_plus = np.ma.masked_array(a1, mask=a.mask | ~mask).argmin()
trough_minus (np.ma.masked_array(a1, mask=a.mask | mask).argmin()
Again, since masked arrays have reversed masks, it's important to combine the masks using | instead of &, as you would for normal numpy boolean masks. In this case, there is no need for calling the nan version of argmax and argmin, since all the nans are already masked out.
Hopefully, the generalization to multiple dimensions becomes clear from here, given the prevalence of the axis keyword in numpy functions:
a = np.ma.array(a, mask=np.isnan(a))
peak_indices = a.argmax(axis=0).reshape(1, *a.shape[1:])
mask = np.arange(a.shape[0]).reshape(-1, *(1,) * (a.ndim - 1)) >= peak_indices
trough_plus = np.ma.masked_array(a, mask=~mask | a.mask).argmin(axis=0)
trough_minus = np.ma.masked_array(a, mask=mask | a.mask).argmin(axis=0)
N-dimensional masking technique comes from Fill mask efficiently based on start indices, which was asked just for this purpose.
Here is a method that
copies the data
saves all nan positions and replaces all nans with global min-1
finds the rowwise argmax
subtracts its value from the entire row
note that each row now has only non-positive values with the max value now being zero
zeros all nan positions
flips the sign of all values right of the max
this is the main idea; it creates a new row-global max at the position where before there was the right hand min; at the same time it ensures that the left hand min is now row-global
retrieves the rowwise argmin and argmax, these are the postitions of the left and right mins in the original array
finds all-nan rows and overwrites the max and min indices at these positions with INVALINT
Code:
INVALINT = -9999
t,x,y = a.shape
t,x,y = np.ogrid[:t,:x,:y]
inval = np.isnan(a)
b = np.where(inval,np.nanmin(a)-1,a)
pk = b.argmax(axis=0)
pkval = b[pk,x,y]
b -= pkval
b[inval] = 0
b[t>pk[None]] *= -1
ltr = b.argmin(axis=0)
rtr = b.argmax(axis=0)
del b
inval = inval.all(axis=0)
pk[inval] = INVALINT
ltr[inval] = INVALINT
rtr[inval] = INVALINT
# result is now in ltr ("left trough"), pk ("peak") and rtr
I have a huge matrix (think 20000 x 1000) called Z that I need to generate the pairwise distance from so I'm currently using sklearn.metrics.pairwise.euclidean_distances(Z,Z) to generate the pairwise distances.
However, now I need to search through the result to find the smallest X distances but I need their indices.
An example would be:
A = 20000 x 1000 numpy.ndarray
B = sklearn.metrics.pairwise.euclidean_distances(A, A)
C = ((2400,100), (800,900), (29,999)) if X = 3
What would be the best way to go about doing this? I saw numpy.unravel_index(a.argmax(), a.shape) but I'm not sure it would work well for this instance.
You can use np.triu_indices to generate the indices that correspond to entries of the compressed distance matrix.
import numpy as np
from scipy.spatial.distance import pdist
# Generate points
Z = np.random.normal(0, 1, (1000, 3))
# Compute euclidean distance
distance = pdist(Z)
# Get the smallest distance
min_distance = np.min(distance)
# Get the indices (k = 1 to omit diagonal entries)
idx = np.asarray(np.triu_indices(len(Z), 1))
# Filter the indices (this is assuming that the minimum distance is not unique)
idx = idx[:, distance == min_distance]
If you know that there is exactly one minimum distance, you could also use
idx = idx[:, np.argmin(distance)]
which is slightly more efficient.
EDIT: To get the sorted indices, try the following
idx = idx[:, np.argsort(distance)]
I have a meshgrid in numpy. I make some calculations on the points. I want to filter out points that could not be calcutaled for some reason ( division by zero).
from numpy import arange, array
Xout = arange(-400, 400, 20)
Yout = arange(0, 400, 20)
Zout = arange(0, 400, 20)
Xout_3d, Yout_3d, Zout_3d = numpy.meshgrid(Xout,Yout,Zout)
#some calculations
# for example
b = z / ( y - x )
To perform z / ( y - x ) using those 3D mesh arrays, you can create a mask of the valid ones. Now, the valid ones would be the ones where any pair of combinations between y and x aren't identical. So, this mask would be of shape (M,N), where M and N are the lengths of the Y and X axes respectively. To get such a mask to span across all combinations between X and Y, we could use NumPy's broadcasting. Thus, we would have such a mask like so -
mask = Yout[:,None] != Xout
Finally, and again using broadcasting to broadcast the mask along the first two axes of the3D arrays, we could perform such a division and choose between an invalid specifier and the actual division result using np.where, like so -
invalid_spec = 0
out = np.where(mask[...,None],Zout_3d/(Yout_3d-Xout_3d),invalid_spec)
Alternatively, we can directly get to such an output using broadcasting and thus avoid using meshgrid and having those heavy 3D arrays in workspace. The idea is to simultaneously populate the 3D grids and perform the subtraction and division computations, both on the fly. So, the implementation would look something like this -
np.where(mask[...,None],Zout/(Yout[:,None,None] - Xout[:,None]),invalid_spec)
I'm working with a triangulated mesh consisting of points 3 x n and triangles specified by the point indices 3 x m. I can easily plot that e.g. using mlab
mesh = mlab.triangular_mesh(p[0,:],p[1,:],p[2,:],t.T
I am also generating a mask masking points which are out of bounds or nan, so I have a mask the size of n. Now I want to mask the triangles which have a masked point. My solutions so far:
1: Use the mask to turn all masked points into nan, e.g.
p[mask] = nan
mlab then still shows nan (I would need to include a threshold filter...) and I actually don't want to mess with my data
2: Generating a triangle mask, which I started like this
def triangleMask(triangles, pointmask):
maskedTris = np.zeros((triangles.shape[1]), dtype=np.bool)
maskedIdx = np.nonzero(pointmask)[0]
for i,t in enumerate(triangles.T):
if (i%5000) == 0:
print('working it.:', i)
for p in t:
if p in maskedIdx:
maskedTris[i] = True
break
return maskedTris
This works, but is not fast. And in my case, n = 250.000 and m = 500.000, so "not fast" is quite a problem.
I know there's a mask keyword in mlab, but I cant get it to work. Masking only the points in the triangular_mesh call yields and error since t then refers to indices which are larger than the size of p.
So you have a points array of shape (3, n), a triangles array of shape (3, m) and a point_mask boolean array of shape (n,), and would like to create a triangle_mask of shape (m,) holding True at position j if any of the indices in triangles[:, j] corresponds to a True in point_mask. You can do that with a little bit of fancy indexing:
triangle_mask = np.any(point_mask[triangles], axis=0)
To understand what's going on, point_mask[triangles] creates a boolean array of shape (3, m), with the value at position (i, j) being point_mask[triangles[i, j]].