How can we do the following operation in one line only in numpy?
medians = np.median(x, axis=0)
for i in range(0, len(x)): # transforming input data to binary values
for j in range(0, len(x[i])):
x[i][j] = 1 if x[i][j] <= medians[j] else 2
What it does is to transform this feature vector into binary values based on the values of the median for that dimension of data.
Use broadcasting:
x = (x <= np.median(x, axis=0))
The result will be a boolean array of zeros and ones. This wouldn't work, by the way, if you tried it with axis=1, because broadcasting matches axes from the right. Instead, you would have to insert a placeholder for the reduced axis, e.g. like this:
x = (x <= np.median(x, axis=1)[..., np.newaxis])
An even more general approach would be
x = (x <= np.median(x, axis=<whatever>, keepdims=True))
Since booleans are technically a subclass of integers in Python, and numpy honors that convention, you can get a mask with ones and twos instead of zeros and ones by adding one to the result, however you choose to compute it:
x = ... + 1
Related
I have one matrix and one vector, of dimensions (N, d) and (N,) respectively. For each row, I want to divide each element by the corresponding value in the vector. I was wondering if there was a vectorized implementation (to save computation time). (I'm trying to create points on the surface of a d-dimensional sphere.) Right now I'm doing this:
x = np.random.randn(N,d)
norm = np.linalg.norm(x, axis=1)
for i in range(N):
for j in range(d):
x[i][j] = x[i][j] / norm[i]
np.linalg.norm has a keepdims argument just for this:
x /= np.linalg.norm(x, axis=1, keepdims=True)
I have a 3 dimensional array of hape (365, x, y) where 36 corresponds to =daily data. In some cases, all the elements along the time axis axis=0 are np.nan.
The time series for each point along the axis=0 looks something like this:
I need to find the index at which the maximum value (peak data) occurs and then the two minimum values on each side of the peak.
import numpy as np
a = np.random.random(365, 3, 3) * 10
a[:, 0, 0] = np.nan
peak_mask = np.ma.masked_array(a, np.isnan(a))
peak_indexes = np.nanargmax(peak_mask, axis=0)
I can find the minimum before the peak using something like this:
early_minimum_indexes = np.full_like(peak_indexes, fill_value=0)
for i in range(peak_indexes.shape[0]):
for j in range(peak_indexes.shape[1]):
if peak_indexes[i, j] == 0:
early_minimum_indexes[i, j] = 0
else:
early_mask = np.ma.masked_array(a, np.isnan(a))
early_loc = np.nanargmin(early_mask[:peak_indexes[i, j], i, j], axis=0)
early_minimum_indexes[i, j] = early_loc
With the resulting peak and trough plotted like this:
This approach is very unreasonable time-wise for large arrays (1m+ elements). Is there a better way to do this using numpy?
While using masked arrays may not be the most efficient solution in this case, it will allow you to perform masked operations on specific axes while more-or-less preserving shape, which is a great convenience. Keep in mind that in many cases, the masked functions will still end up copying the masked data.
You have mostly the right idea in your current code, but you missed a couple of tricks, like being able to negate and combine masks. Also the fact that allocating masks as boolean up front is more efficient, and little nitpicks like np.full(..., 0) -> np.zeros(..., dtype=bool).
Let's work through this backwards. Let's say you had a well-behaved 1-D array with a peak, say a1. You can use masking to easily find the maxima and minima (or indices) like this:
peak_index = np.nanargmax(a1)
mask = np.zeros(a1.size, dtype=np.bool)
mask[peak:] = True
trough_plus = np.nanargmin(np.ma.array(a1, mask=~mask))
trough_minus = np.nanargmin(np.ma.array(a1, mask=mask))
This respects the fact that masked arrays flip the sense of the mask relative to normal numpy boolean indexing. It's also OK that the maximum value appears in the calculation of trough_plus, since it's guaranteed not to be a minimum (unless you have the all-nan situation).
Now if a1 was a masked array already (but still 1D), you could do the same thing, but combine the masks temporarily. For example:
a1 = np.ma.array(a1, mask=np.isnan(a1))
peak_index = a1.argmax()
mask = np.zeros(a1.size, dtype=np.bool)
mask[peak:] = True
trough_plus = np.ma.masked_array(a1, mask=a.mask | ~mask).argmin()
trough_minus (np.ma.masked_array(a1, mask=a.mask | mask).argmin()
Again, since masked arrays have reversed masks, it's important to combine the masks using | instead of &, as you would for normal numpy boolean masks. In this case, there is no need for calling the nan version of argmax and argmin, since all the nans are already masked out.
Hopefully, the generalization to multiple dimensions becomes clear from here, given the prevalence of the axis keyword in numpy functions:
a = np.ma.array(a, mask=np.isnan(a))
peak_indices = a.argmax(axis=0).reshape(1, *a.shape[1:])
mask = np.arange(a.shape[0]).reshape(-1, *(1,) * (a.ndim - 1)) >= peak_indices
trough_plus = np.ma.masked_array(a, mask=~mask | a.mask).argmin(axis=0)
trough_minus = np.ma.masked_array(a, mask=mask | a.mask).argmin(axis=0)
N-dimensional masking technique comes from Fill mask efficiently based on start indices, which was asked just for this purpose.
Here is a method that
copies the data
saves all nan positions and replaces all nans with global min-1
finds the rowwise argmax
subtracts its value from the entire row
note that each row now has only non-positive values with the max value now being zero
zeros all nan positions
flips the sign of all values right of the max
this is the main idea; it creates a new row-global max at the position where before there was the right hand min; at the same time it ensures that the left hand min is now row-global
retrieves the rowwise argmin and argmax, these are the postitions of the left and right mins in the original array
finds all-nan rows and overwrites the max and min indices at these positions with INVALINT
Code:
INVALINT = -9999
t,x,y = a.shape
t,x,y = np.ogrid[:t,:x,:y]
inval = np.isnan(a)
b = np.where(inval,np.nanmin(a)-1,a)
pk = b.argmax(axis=0)
pkval = b[pk,x,y]
b -= pkval
b[inval] = 0
b[t>pk[None]] *= -1
ltr = b.argmin(axis=0)
rtr = b.argmax(axis=0)
del b
inval = inval.all(axis=0)
pk[inval] = INVALINT
ltr[inval] = INVALINT
rtr[inval] = INVALINT
# result is now in ltr ("left trough"), pk ("peak") and rtr
I've an image of about 8000x9000 size as a numpy matrix. I also have a list of indices in a numpy 2xn matrix. These indices are fractional as well as may be out of image size. I need to interpolate the image and find the values for the given indices. If the indices fall outside, I need to return numpy.nan for them. Currently I'm doing it in for loop as below
def interpolate_image(image: numpy.ndarray, indices: numpy.ndarray) -> numpy.ndarray:
"""
:param image:
:param indices: 2xN matrix. 1st row is dim1 (rows) indices, 2nd row is dim2 (cols) indices
:return:
"""
# Todo: Vectorize this
M, N = image.shape
num_indices = indices.shape[1]
interpolated_image = numpy.zeros((1, num_indices))
for i in range(num_indices):
x, y = indices[:, i]
if (x < 0 or x > M - 1) or (y < 0 or y > N - 1):
interpolated_image[0, i] = numpy.nan
else:
# Todo: Do Bilinear Interpolation. For now nearest neighbor is implemented
interpolated_image[0, i] = image[int(round(x)), int(round(y))]
return interpolated_image
But the for loop is taking huge amount of time (as expected). How can I vectorize this? I found scipy.interpolate.interp2d, but I'm not able to use it. Can someone explain how to use this or any other method is also fine. I also found this, but again it is not according to my requirements. Given x and y indices, these generated interpolated matrices. I don't want that. For the given indices, I just want the interpolated values i.e. I need a vector output. Not a matrix.
I tried like this, but as said above, it gives a matrix output
f = interpolate.interp2d(numpy.arange(image.shape[0]), numpy.arange(image.shape[1]), image, kind='linear')
interp_image_vect = f(indices[:,0], indices[:,1])
RuntimeError: Cannot produce output of size 73156608x73156608 (size too large)
For now, I've implemented nearest-neighbor interpolation. scipy interp2d doesn't have nearest neighbor. It would be good if the library function as nearest neighbor (so I can compare). If not, then also fine.
It looks like scipy.interpolate.RectBivariateSpline will do the trick:
from scipy.interpolate import RectBivariateSpline
image = # as given
indices = # as given
spline = RectBivariateSpline(numpy.arange(M), numpy.arange(N), image)
interpolated = spline(indices[0], indices[1], grid=False)
This gets you the interpolated values, but it doesn't give you nan where you need it. You can get that with where:
nans = numpy.zeros(interpolated.shape) + numpy.nan
x_in_bounds = (0 <= indices[0]) & (indices[0] < M)
y_in_bounds = (0 <= indices[1]) & (indices[1] < N)
bounded = numpy.where(x_in_bounds & y_in_bounds, interpolated, nans)
I tested this with a 2624x2624 image and 100,000 points in indices and all told it took under a second.
Is there a numpy function to ensure a 1D- or 2D- array to be either a column or row vector?
For example, I have either one of the following vectors/lists. What is the easiest way to convert any of the input into a column vector?
x1 = np.array(range(5))
x2 = x1[np.newaxis, :]
x3 = x1[:, np.newaxis]
def ensureCol1D(x):
# The input is either a 0D list or 1D.
assert(len(x.shape)==1 or (len(x.shape)==2 and 1 in x.shape))
x = np.atleast_2d(x)
n = x.size
print(x.shape, n)
return x if x.shape[0] == n else x.T
assert(ensureCol1D(x1).shape == (x1.size, 1))
assert(ensureCol1D(x2).shape == (x2.size, 1))
assert(ensureCol1D(x3).shape == (x3.size, 1))
Instead of writing my own function ensureCol1D, is there something similar already available in numpy that ensures a vector to be column?
Your question is essentially how to convert an array into a "column", a column being a 2D array with a row length of 1. This can be done with ndarray.reshape(-1, 1).
This means that you reshape your array to have a row length of one, and let numpy infer the number of rows / column length.
x1 = np.array(range(5))
print(x1.reshape(-1, 1))
Output:
array([[0],
[1],
[2],
[3],
[4]])
You get the same output when reshaping x2 and x3. Additionally this also works for n-dimensional arrays:
x = np.random.rand(1, 2, 3)
print(x.reshape(-1, 1).shape)
Output:
(6, 1)
Finally the only thing missing here is that you make some assertions to ensure that arrays that cannot be converted are not converted incorrectly. The main check you're making is that the number of non-one integers in the shape is less than or equal to one. This can be done with:
assert sum(i != 1 for i in x1.shape) <= 1
This check along with .reshape let's you apply your logic on all numpy arrays.
I have a matrix of counts,
import numpy as np
x = np.array([[ 1,2,3],[1,4,6],[2,3,7]])
And I need the percentages of the total along axis = 1:
for i in range(x.shape[0]):
for j in range(x.shape[1]):
x[i,j] = x[i,j] / np.sum(x[i,:])
In numpy broadcast form.
Currently, I have:
x_sums = np.sum(x,axis=1)
for j in range(x.shape[1]):
x[:,j] = x[:,j] / x_sums[:]
Which puts most of the complexity in numpy code...but a numpy one liner would be best.
Also,
def percentages(a):
return a / np.sum(a)
x_percentages = np.apply_along_axis(percentages,1,x)
But that still involves python.
np.linalg.norm
Is very close, in terms of what is going on, but they only have the 8 hardcoded norms, which does not include percentage of total.
Then there is np.percentile, which is again close...but it is computing the sorted percentile.
x /= x.sum(axis=1, keepdims=True)
Altough x should have a floating point dtype for this to work correctly.
Better may be:
x = np.true_divide(x, x.sum(axis=1, keepdims=True))
Could this be what you are after:
print (x.T/np.sum(x, axis=1)).T