Index n-th dimension of Numpy array with 2d index array - python

I have the following 2-D Numpy arrays:
X # X.shape = (11688, 144)
Y # Y.shape = (2912, 1000)
The first array is populated with atmospheric data, and the second array is populated with random index values from 0 to X.shape[0]-1. I want to index the rows of X with each column of Y to yield a 3-D array result, where result.shape = (2912, 1000, 144), and I want to do this without looping.
My current approach is:
result = X[Y,:]
but this one line of code can take more than 10 seconds to execute depending on the shape of the 0th axis of Y.
Is there a more optimal way to perform this type of indexing in order to speed up its execution?
EDIT: Here's a more complete example of what I'm trying to accomplish.
X = np.random.rand(11688, 144) # Time-by-longitude array of atmospheric data
t = np.arange(X.shape[0]) # Time vector
# Populate array of randomly drawn time steps
Y = np.zeros((2912, 1000), dtype='i')
for i in xrange(1000):
Y[:,i] = np.random.choice(t, 2912)
# Index X with each column of Y
result = X[Y,:]

Related

Random sample from specific rows and columns of a 2d numpy array (essentially sampling by ignoring edge effects)

I have a 2d numpy array size 100 x 100.
I want to randomly sample values from the "inside" 80 x 80 values so that I can exclude values which are influenced by edge effects. I want to sample from row 10 to row 90 and within that from column 10 to column 90.
However, importantly, I need to retain the original index values from the 100 x 100 grid, so I can't just trim the dataset and move on. If I do that, I am not really solving the edge effect problem because this is occurring within a loop with multiple iterations.
gridsize = 100
new_abundances = np.zeros([100,100],dtype=np.uint8)
min_select = int(np.around(gridsize * 0.10))
max_select = int(gridsize - (np.around(gridsize * 0.10)))
row_idx =np.arange(min_select,max_select)
col_idx = np.arange(min_select,max_select)
indices_random = ????? Somehow randomly sample from new_abundances only within the rows and columns of row_idx and col_idx set.
What I ultimately need is a list of 250 random indices selected from within the flattened new_abundances array. I need to keep the new_abundances array as 2d to identify the "edges" but once that is done, I need to flatten it to get the indices which are randomly selected.
Desired output:
An 1d list of indices from a flattened new_abundances array.
Woudl something like solve your problem?
import numpy as np
np.random.seed(0)
mat = np.random.random(size=(100,100))
x_indices = np.random.randint(low=10, high=90, size=250)
y_indices = np.random.randint(low=10, high=90, size=250)
coordinates = list(zip(x_indices,y_indices))
flat_mat = mat.flatten()
flat_index = x_indices * 100 + y_indices
Then you can access elements using any value from the coordinates list, e.g. mat[coordinates[0]] returns the the matrix value at coordinates[0]. Value of coordinates[0] is (38, 45) in my case. If the matrix is flattened, you can calculate the 1D index of the corresponding element. In this case, mat[coordinates[0]] == flat_mat[flat_index[0]] holds, where flat_index[0]==3845=100*38+45
Please also note that multiple sampling of the original data is possible this way.
Using your notation:
import numpy as np
np.random.seed(0)
gridsize = 100
new_abundances = np.zeros([100,100],dtype=np.uint8)
min_select = int(np.around(gridsize * 0.10))
max_select = int(gridsize - (np.around(gridsize * 0.10)))
x_indices = np.random.randint(low=min_select, high=max_select, size=250)
y_indices = np.random.randint(low=min_select, high=max_select, size=250)
coords = list(zip(x_indices,y_indices))
flat_new_abundances = new_abundances.flatten()
flat_index = x_indices * gridsize + y_indices

Average of a 3D numpy slice based on 2D arrays

I am trying to calculate the average of a 3D array between two indices on the 1st axis. The start and end indices vary from cell to cell and are represented by two separate 2D arrays that are the same shape as a slice of the 3D array.
I have managed to implement a piece of code that loops through the pixels of my 3D array, but this method is painfully slow in the case of my array with a shape of (70, 550, 350). Is there a way to vectorise the operation using numpy or xarray (the arrays are stored in an xarray dataset)?
Here is a snippet of what I would like to optimise:
# My 3D raster containing values; shape = (time, x, y)
values = np.random.rand(10, 55, 60)
# A 2D raster containing start indices for the averaging
start_index = np.random.randint(0, 4, size=(values.shape[1], values.shape[2]))
# A 2D raster containing end indices for the averaging
end_index = np.random.randint(5, 9, size=(values.shape[1], values.shape[2]))
# Initialise an array that will contain results
mean_array = np.zeros_like(values[0, :, :])
# Loop over 3D raster to calculate the average between indices on axis 0
for i in range(0, values.shape[1]):
for j in range(0, values.shape[2]):
mean_array[i, j] = np.mean(values[start_index[i, j]: end_index[i, j], i, j], axis=0)
One way to do this without loops is to zero-out the entries you don't want to use, compute the sum of the remaining items, then divide by the number of nonzero entries. For example:
i = np.arange(values.shape[0])[:, None, None]
mean_array_2 = np.where((i >= start_index) & (i < end_index), values, 0).sum(0) / (end_index - start_index)
np.allclose(mean_array, mean_array_2)
# True
Note that this assumes that the indices are in the range 0 <= i < values.shape[0]; if this is not the case you can use np.clip or other means to standardize the indices before computation.

How to transform a Dataframe of xyz coordinates into a binary array of shape (272, 512, 512)

I have a Dataframe that corresponds to a 3D centerline (x,y,z). I want to turn the Dataframe into a binary array with shape (272, 512, 512). The z values from the Dataframe range from about 40-160 and they correspond to the first column in the array. The x and y values correspond to the second and third columns in the array, respectively. Any xyz value not in the Dataframe should correspond to a 0 in the array and any value that is present should correspond to a 1. Any ideas on how to do this considering each plane/slice may have multiple 1's in the array?
I was able to accomplish this if I limited the Dataframe to only have one row per unique z value (one point for each slice) but the real data has multiple rows per unique z value.
Here is what the header of the Dataframe looks like
This is the code that works for downsampled Dataframe (only one row per unique z value):
def dataframe_to_binary_array(df):
'''
THIS FUNCTION TAKES IN A DOWNSAMPLED DATAFRAME AND CONVERTS IT TO A 3D
BINARY ARRAY THAT IS THE SAME SHAPE AS THE ORIGINAL DICOM STACK
'''
empty_array = np.zeros([272, 512, 512], dtype='int64')
z_column = df['Z']
for z in z_column:
z_df = df[z_column == z]
for k in range(0, 272):
x = z_df['X']
y = z_df['Y']
empty_array[z, x, y] = 1
return empty_array
Here is my attempt at code for the true Dataframe:
def dataframe_to_binary_array_new(df):
'''
THIS FUNCTION TAKES IN A DOWNSAMPLED DATAFRAME AND CONVERTS IT TO A 3D
BINARY ARRAY THAT IS THE SAME SHAPE AS THE ORIGINAL DICOM STACK
'''
empty_array = np.zeros([272, 512, 512], dtype='int64')
z_column = df['Z']
for i in range(0,272):
z_df = df[z_column == i]
for row in z_df:
x_col = z_df['X'].to_numpy()
y_col = z_df['Y'].to_numpy()
for x_element in x_col:
x = int(x_element)
for y_element in y_col:
y = int(y_element)
empty_array[i,x,y] = 1
return empty_array
The error message I get is "IndexError: only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices"
I'd come at this a different way. How about iterating over the rows of the original dataframe. Then use the coordinate from each dataframe row to set the appropriate element in empty_array to 1.
Below's some example code. empty_array is renamed as binary_array. You may need to convert your coordinates from floats to integers to be able to use then as indices in binary_array.
# x, y, z are integers from [0, 10)
n = 10
binary_array = np.zeros([n]*3)
# Builds 10 example coordinates
df = pd.DataFrame(np.random.randint(n, size=(10,3)), columns=list('XYZ'))
for idx, coord in df.iterrows():
x, y, z = tuple(coord)
binary_array[x, y, z] = 1
As a frame challenge, I'd ask you to consider why you're changing it to a 3D array. Your array would have 71 million entries. How do that compare to the size of your dataframe?
You're probably not creating a 3D array just to have a 3D. You have some things that you want to do with the 3D array. You should consider whether those things are really easier to implement with a 3D array. Presumably, you want an object my_array such that you can do my_array.get_value(x,y,z) and get a 1 if the tuple (x,y,z) corresponds to a row in the original dataframe, and 0 otherwise. But it's rather simple to create a wrapper around the original dataframe that does that. You could also create a set out of the tuples that appear in each dataframe row, and then simply query the set for inclusion.

Multidimensional indexing in numpy

Suppose you have a 3d numpy array, how can I build the mean of the N maximum elements along a given axis? So basically something like:
a = np.random.randint(10, size=(100,100,100)) #axes x,y,z
result = np.mean(a, axis = 2)
however, I want to restrict the mean to the N maximum values along axis z. To illustrate the issue, this is a solution using looping:
a = np.random.randint(10, size=(100,100,100)) #axes x,y,z
N = 5
maxima = np.zeros((100,100,N)) #container for mean of N max values along axis z
for x in range(100): #loop through x axis
for y in range(100): #loop through y axis
max_idx = a[x, y, :].argsort()[-N:] #indices of N max values along z axis
maxima[x, y, :] = a[x, y , max_idx] #extract values
result = np.mean(maxima, axis = 2) #take the mean
I would like to achieve the same result with multidimensional indexing.
Here's one approach using np.argpartition to get the max N indices and then advanced-indexing for extracting and computing the desired average values -
# Get max N indices along the last axis
maxN_indx = np.argpartition(a,-N, axis=-1)[...,-N:]
# Get a list of indices for use in advanced-indexing into input array,
# alongwith the max N indices along the last axis
all_idx = np.ogrid[tuple(map(slice, a.shape))]
all_idx[-1] = maxN_indx
# Index and get the mean along the last axis
out = a[all_idx].mean(-1)
Last step could also be expressed in an explicit way for advanced-indexing, like so -
m,n = a.shape[:2]
out = a[np.arange(m)[:,None,None], np.arange(n)[:,None], maxN_indx].mean(-1)

compute matrix product for multiple inputs

I am trying to compute a transform given by b = A*x. A is a (3,4) matrix. If x is one (4,1) vector the result is b (3,1).
Instead, for x I have a bunch of vectors concatenated into a matrix and I am trying to evaluate the transform for each value of x. So x is (20, 4). How do I broadcast this in numpy such that I get 20 resulting values for b (20,3)?
I could loop over each input and compute the output but it feels like there must be a better way using broadcasting.
Eg.
A = [[1,0,0,0],
[2,0,0,0],
[3,0,0,0]]
if x is:
x = [[1,1,1,1],
[2,2,2,2]]
b = [[1,2,3],
[2,4,6]]
Each row of x is multiplied with A and result is stored as a row in b.
numpy dot
import numpy as np
A = np.random.normal(size=(3,4))
x = np.random.normal(size=(4,20))
y = np.dot(A,x)
print y.shape
Result: (3, 20)
And of course if you want (20,3) you can use np.transpose()

Categories