Masking and histogram of the 3d grid - python

All
Have reasonable large 3d grid as NumPy array of floats, shape (nx, ny, nz). Have similar (same shape) 3d grid of 1s and 0s, essentially a bitmask. I would like to select data from grid based on bitmask and use them later for a histogram.
What I do now is
k = 0
for iz in range(0, nz):
for iy in range(0, ny):
for ix in range(0, nx):
d = data[ix, iy, iz]
b = bitmap[ix, iy, iz]
if b > 0:
droi[k] = d
k += 1
hist, bins = np.histogram(droi, bins = 200, range=(0.0, dmax))
Which is unelegant and slow. I thought about flattening both arrays and multiplying them, and running histogram on whole thing, but 0 is potentially in data as well, so it would change histogram.
Any thoughts how to do it faster and in smaller code?

If you convert bitmap to an array of booleans, you can use boolean array indexing to get the elements of data corresponding to the True elements in bitmap:
bitmapbool = numpy.array(bitmap, dtype=bool)
droi = data[bitmapbool]

Related

Efficient way to map 3D function to a meshgrid with NumPy

I have a set of data values for a scalar 3D function, arranged as inputs x,y,z in an array of shape (n,3) and the function values f(x,y,z) in an array of shape (n,).
EDIT: For instance, consider the following simple function
data = np.array([np.arange(n)]*3).T
F = np.linalg.norm(data,axis=1)**2
I would like to convolve this function with a spherical kernel in order to perform a 3D smoothing. The easiest way I found to perform this is to map the function values in a 3D spatial grid and then apply a 3D convolution with the kernel I want.
This works fine, however the part that maps the 3D function to the 3D grid is very slow, as I did not find a way to do it with NumPy only. The code below is my actual implementation, where data is the (n,3) array containing the 3D positions in which the function is evaluated, F is the (n,) array containing the corresponding values of the function and M is the (N,N,N) array that contains the 3D space grid.
step = 0.1
# Create meshgrid
xmin = data[:,0].min()
xmax = data[:,0].max()
ymin = data[:,1].min()
ymax = data[:,1].max()
zmin = data[:,2].min()
zmax = data[:,2].max()
x = np.linspace(xmin,xmax,int((xmax-xmin)/step)+1)
y = np.linspace(ymin,ymax,int((ymax-ymin)/step)+1)
z = np.linspace(zmin,zmax,int((zmax-zmin)/step)+1)
# Build image
M = np.zeros((len(x),len(y),len(z)))
for l in range(len(data)):
for i in range(len(x)-1):
if x[i] < data[l,0] < x[i+1]:
for j in range(len(y)-1):
if y[j] < data[l,1] < y[j+1]:
for k in range(len(z)-1):
if z[k] < data[l,2] < z[k+1]:
M[i,j,k] = F[l]
Is there a more efficient way to fill a 3D spatial grid with the values of a 3D function ?
For each item of data you're scanning pixels of cuboid to check if it's inside. There is an option to skip this scan. You could calculate corresponding indices of these pixels by yourself, for example:
data = np.array([[1, 2, 3], #14 (corner1)
[4, 5, 6], #77 (corner2)
[2.5, 3.5, 4.5], #38.75 (duplicated pixel)
[2.9, 3.9, 4.9], #47.63 (duplicated pixel)
[1.5, 2, 3]]) #15.25 (one step up from [1, 2, 3])
step = 0.5
data_idx = ((data - data.min(axis=0))//step).astype(int)
M = np.zeros(np.max(data_idx, axis=0) + 1)
x, y, z = data_idx.T
M[x, y, z] = F
Note that only one value of duplicated pixels is being mapped to M.
All you need is just reshape F[:, 3] (only f(x, y, z)) into a grid. Hard to be more precise without sample data:
If the data is not sorted, you need to sort it:
F_sorted = F[np.lexsort((F[:,0], F[:,1], F[:,2]))] # sort by x, then y, then z
Choose only f(x, y, z)
F_values = F_sorted[:, 3]
Finally, reshape data into a grid:
M = F_sorted.reshape(N, N, N)
This method is faster than the original (approximatly 20x speed up):
step = 0.1
mins = np.min(data, axis=0)
maxs = np.max(data, axis=0)
ranges = np.floor((maxs - mins) / step + 1).astype(int)
indx = np.zeros(data.shape,dtype=int)
for i in range(3):
x = np.linspace(mins[i], maxs[i], ranges[i])
indx[:,i] = np.argmax(data[:,i,np.newaxis] <= (x[np.newaxis,:]), axis=1) -1
M = np.zeros(ranges)
M[indx[:,0],indx[:,1],indx[:,2]] = F
The first part sets up the required grid variables. The argmax function provides a simple (and fast) way to find the first true value of the broadcasted array. This produces a set of indices for x, y and z directions for each of the function values.
The resulting array M is not the same as that produced by the original code as the original code loses data. The logic of y[j] < data[l,1] < y[j+1] where y is a vector produced using linspace means the minimum and maximum values for each direction will be missed (data[l,1] might be equal to either y[j] or y[j+1]!). Run it with a dataset of two values each with their own coordinates and the M array will be all zeros.

Fill values into numpy array that depend non-trivially on indices

The problem: I'm trying to fill a 2D array arr with values where the values depend on the indices (i, j) in some nontrivial way. More precisely, i and j together provide a new index k (i, j, and k all have the same range), which I then use to lookup a value in some other array (i.e., H[i,j] = values[k]).
My initial thought was that np.put_along_axis could be used for this. I generated two lists indices and values, such that
nrows, ncols = arr.shape
for i in range(nrows):
arr[i, indices[i]] = values[i]
In principle this works fine, but when I try
np.put_along_axis(arr, indices, values, axis=1)
I get the following error
AttributeError: 'list' object has no attribute 'dtype'
However, I can't make these lists into arrays because they're ragged; some rows have fewer values that need insertion than others. I am wondering if there is a way to use np.put_along_axis?
In short you probably want to use np.indices.
Since you didn't give an example I will use indices to calculate polar coordinates and look them up in an other picture.
First I have a picture to look up the values later
import matplotlib.pyplot as plt
import matplotlib
import numpy as np
n = 100
func = lambda i,j: np.linalg.norm(np.array([i-n/2,j-n/2]), axis=0)
arr = np.fromfunction(func, (n,n), dtype='int')
arr = (arr < np.median(arr)).astype('int')
plt.imshow(arr, cmap='gray')
Now I calculate polar coordinates on the above picture. In case you need a refresher on your calculus. This means we identify points by distance to a point and angle. I.e. if you go left/right in the below picture you go in a circle (counterclockwise/clockwise) on the above on and up and down means you go to and away from the center. In polar coordinates the disk should more or less turn into a rectangle.
r,phi = np.indices(arr.shape, dtype='float')
r *= 50/100
phi *= 2*np.pi/100
def polar2cartesian(r, phi):
x = r * np.cos(phi)
y = r * np.sin(phi)
return(x, y)
i,j = polar2cartesian(r, phi)
i = (i+50).astype('int')
j = (j+50).astype('int')
out = np.zeros(arr.shape)
out = arr[i,j]
plt.imshow(out, cmap='gray')
plt.xlabel('phi (0 to 2pi)')
plt.ylabel('r (0 to 50)')

Correlate a 1d array with every 4d array in python in short time

I have a a 4d array (called a) with shape(35, 2000, 60, 180) that I need to correlate with a 1d array (called b) that has length of 2000, while detrending and smoothing both arrays.
I managed to use a nested for-loop to correlate the 1d array with the 3d array (called c) shape(x, y, z) by iterating through each point y, z, detrending c[x, y, :] and storing the correlation coefficient between b and at that point.
However, using a 3x-nested for-loop to calculate the correlation with a 4d array takes too much time to compute. Is there a more efficient way to produce an array that contains the correlation coefficient between each timeseries in a 4d array and a 1d array?
Here is my code for calculating the correlation with only 3 dimensions involved. It takes around a minute to execute on an array with shape(2000, 60, 180).
Also, the larger array has nan's, in which case, i set the correlation for the entire x,y point t be nan.
def correlation_detrended(cs, ts, smooth=360):
cs_det = cs
ts_det = ts
signal.detrend(ts_det[~np.isnan(ts_det)], overwrite_data=True)
ts_det = pd.DataFrame(ts_det).rolling(smooth, center=True).mean().to_numpy()[:, 0]
for i in range(len(cs_det[0, :, 0])):
for j in range(len(cs_det[0, i, :])):
print(str(i) + ":" + str(j) )
if np.any(np.isnan(cs_det[:, i, j])):
r, p = (np.nan, np.nan)
else:
signal.detrend(cs_det[:, i, j], overwrite_data=True)
cs_det[:, i, j] = pd.DataFrame(cs_det[:, i, j]).rolling(smooth, center=True).mean().to_numpy()[:, 0]
offset = int((smooth/2+120))
r, p = stats.pearsonr(cs_det[offset:-(offset), i, j], ts_det[offset:-(offset)])
correlation[i, j] = r
return correlation```

Efficiently select elements from an (x,y) field with a 2D mask in Python

I have a large field of 2D-position data, given as two arrays x and y, where len(x) == len(y). I would like to return the array of indices idx_masked at which (x[idx_masked], y[idx_masked]) is masked by an N x N int array called mask. That is, mask[x[idx_masked], y[idx_masked]] == 1. The mask array consists of 0s and 1s only.
I have come up with the following solution, but it (specifically, the last line below) is very slow, given that I have N x N = 5000 x 5000, repeated 1000s of times:
import numpy as np
import matplotlib.pyplot as plt
# example mask of one corner of a square
N = 100
mask = np.zeros((N, N))
mask[0:10, 0:10] = 1
# example x and y position arrays in arbitrary units
x = np.random.uniform(0, 1, 1000)
y = np.random.uniform(0, 1, 1000)
x_bins = np.linspace(np.min(x), np.max(x), N)
y_bins = np.linspace(np.min(y), np.max(y), N)
x_bin_idx = np.digitize(x, x_bins)
y_bin_idx = np.digitize(y, y_bins)
idx_masked = np.ravel(np.where(mask[y_bin_idx - 1, x_bin_idx - 1] == 1))
plt.imshow(mask[::-1, :])
plt.scatter(x, y, color='red')
plt.scatter(x[idx_masked], y[idx_masked], color='blue')
Is there a more efficient way of doing this?
Given that mask overlays your field with identically-sized bins, you do not need to define the bins explicitly. *_bin_idx can be determined at each location from a simple floor division, since you know that each bin is 1 / N in size. I would recommend using 1 - 0 for the total width (what you passed into np.random.uniform) instead of x.max() - x.min(), if of course you know the expected size of the range.
x0 = 0 # or x.min()
x1 = 1 # or x.max()
x_bin = (x1 - x0) / N
x_bin_idx = ((x - x0) // x_bin).astype(int)
# ditto for y
This will be faster and simpler than digitizing, and avoids the extra bin at the beginning.
For most purposes, you do not need np.where. 90% of the questions asking about it (including this one) should not be using where. If you want a fast way to access the necessary elements of x and y, just use a boolean mask. The mask is simply
selction = mask[x_bin_idx, y_bin_idx].astype(bool)
If mask is already a boolean (which it should be anyway), the expression mask[x_bin_idx, y_bin_idx] is sufficient. It results in an array of the same size as x_bin_idx and y_bin_idx (which are the same size as x and y) containing the mask value for each of your points. You can use the mask as
x[selection] # Elements of x in mask
y[selection] # Elements of y in mask
If you absolutely need the integer indices, where is sill not your best option.
indices = np.flatnonzero(selection)
OR
indices = selection.nonzero()[0]
If your goal is simply to extract values from x and y, I would recommend stacking them together into a single array:
coords = np.stack((x, y), axis=1)
This way, instead of having to apply indices twice, you can extract the values with just
coords[selection, :]
OR
coords[indices, :]
Depending on the relative densities of mask and x and y, either the boolean masking or linear indexing may be faster. You will have to time some relevant cases to get a better intuition.

NumPy 2D array: selecting indices in a circle

For some rectangular we can select all indices in a 2D array very efficiently:
arr[y:y+height, x:x+width]
...where (x, y) is the upper-left corner of the rectangle and height and width the height (number of rows) and width (number of columns) of the rectangular selection.
Now, let's say we want to select all indices in a 2D array located in a certain circle given center coordinates (cx, cy) and radius r. Is there a numpy function to achieve this efficiently?
Currently I am pre-computing the indices manually by having a Python loop that adds indices into a buffer (list). Thus, this is pretty inefficent for large 2D arrays, since I need to queue up every integer lying in some circle.
# buffer for x & y indices
indices_x = list()
indices_y = list()
# lower and upper index range
x_lower, x_upper = int(max(cx-r, 0)), int(min(cx+r, arr.shape[1]-1))
y_lower, y_upper = int(max(cy-r, 0)), int(min(cy+r, arr.shape[0]-1))
range_x = range(x_lower, x_upper)
range_y = range(y_lower, y_upper)
# loop over all indices
for y, x in product(range_y, range_x):
# check if point lies within radius r
if (x-cx)**2 + (y-cy)**2 < r**2:
indices_y.append(y)
indices_x.append(x)
# circle indexing
arr[(indices_y, indices_x)]
As mentioned, this procedure gets quite inefficient for larger arrays / circles. Any ideas for speeding things up?
If there is a better way to index a circle, does this also apply for "arbitrary" 2D shapes? For example, could I somehow pass a function that expresses membership of points for an arbitrary shape to get the corresponding numpy indices of an array?
You could define a mask that contains the circle. Below, I have demonstrated it for a circle, but you could write any arbitrary function in the mask assignment. The field mask has the dimensions of arr and has the value True if the condition on the righthand side is satisfied, and False otherwise. This mask can be used in combination with the indexing operator to assign to only a selection of indices, as the line arr[mask] = 123. demonstrates.
import numpy as np
import matplotlib.pyplot as plt
x = np.arange(0, 32)
y = np.arange(0, 32)
arr = np.zeros((y.size, x.size))
cx = 12.
cy = 16.
r = 5.
# The two lines below could be merged, but I stored the mask
# for code clarity.
mask = (x[np.newaxis,:]-cx)**2 + (y[:,np.newaxis]-cy)**2 < r**2
arr[mask] = 123.
# This plot shows that only within the circle the value is set to 123.
plt.figure(figsize=(6, 6))
plt.pcolormesh(x, y, arr)
plt.colorbar()
plt.show()
Thank you Chiel for your answer, but I couldn't see radius 5 in the output.(diameter is 9 in output and not 10)
One can reduce .5 from cx and cy to produce diameter 10
import numpy as np
import matplotlib.pyplot as plt
x = np.arange(0, 32)
y = np.arange(0, 32)
arr = np.zeros((y.size, x.size))
cx = 12.-.5
cy = 16.-.5
r = 5.
# The two lines below could be merged, but I stored the mask
# for code clarity.
mask = (x[np.newaxis,:]-cx)**2 + (y[:,np.newaxis]-cy)**2 < r**2
arr[mask] = 123.
# This plot shows that only within the circle the value is set to 123.
plt.figure(figsize=(6, 6))
plt.pcolormesh(x, y, arr)
plt.colorbar()
plt.show()

Categories