Arrange and Sub-sample 3d Points in Coordinate Grid With Numpy - python

I have a list of 3d points such as
np.array([
[220, 114, 2000],
[125.24, 214, 2519],
...
[54.1, 254, 1249]
])
The points are in no meaningful order. I'd like to sort and reshape the array in a way that better represents a coordinate grid (such that I have a known width and height and can retrieve Z values by index). I would also like to down sample the points into say whole integers to handle collisions. Applying min,max, or mean during the down sampling.
I know I can down sample a 1d array using np.mean and np.shape
The approach I'm currently using finds the min and max in X,Y and then puts the Z values into a 2d array while doing the down sampling manually.
This iterates the giant array numerous times and I'm wondering if there is a way to do this with np.meshgrid or some other numpy functionality that I'm overlooking.
Thanks

You can use the binning method from Most efficient way to sort an array into bins specified by an index array?
To get an index array from y,x coordinates you can use np.searchsorted and np.ravel_multi_index
Here is a sample implementation, stb module is the code from the linked post.
import numpy as np
from stb import sort_to_bins_sparse as sort_to_bins
def grid1D(u, N):
mn, mx = u.min(), u.max()
return np.linspace(mn, mx, N, endpoint=False)
def gridify(yxz, N):
try:
Ny, Nx = N
except TypeError:
Ny = Nx = N
y, x, z = yxz.T
yg, xg = grid1D(y, Ny), grid1D(x, Nx)
yidx, xidx = yg.searchsorted(y, 'right')-1, xg.searchsorted(x, 'right')-1
yx = np.ravel_multi_index((yidx, xidx), (Ny, Nx))
zs = sort_to_bins(yx, z)
return np.concatenate([[0], np.bincount(yx).cumsum()]), zs, yg, xg
def bin(yxz, N, binning_method='min'):
boundaries, binned, yg, xg = gridify(yxz, N)
result = np.full((yg.size, xg.size), np.nan)
if binning_method == 'min':
result.reshape(-1)[:len(boundaries)-1] = np.minimum.reduceat(binned, boundaries[:-1])
elif binning_method == 'max':
result.reshape(-1)[:len(boundaries)-1] = np.maximum.reduceat(binned, boundaries[:-1])
elif binning_method == 'mean':
result.reshape(-1)[:len(boundaries)-1] = np.add.reduceat(binned, boundaries[:-1]) / np.diff(boundaries)
else:
raise ValueError
result.reshape(-1)[np.where(boundaries[1:] == boundaries[:-1])] = np.nan
return result
def test():
yxz = np.random.uniform(0, 100, (100000, 3))
N = 20
boundaries, binned, yg, xg = gridify(yxz, N)
binmin = bin(yxz, N)
binmean = bin(yxz, N, 'mean')
y, x, z = yxz.T
for i in range(N-1):
for j in range(N-1):
msk = (y>=yg[i]) & (y<yg[i+1]) & (x>=xg[j]) & (x<xg[j+1])
assert (z[msk].min() == binmin[i, j]) if msk.any() else np.isnan(binmin[i, j])
assert np.isclose(z[msk].mean(), binmean[i, j]) if msk.any() else np.isnan(binmean[i, j])

Related

Calculate derivate of spatial measurements

I have a set of spatial distributed measurements.
For each point p1 = [x1,y1,z1] there is a measurement v1 which is a scalar. (e.g. Temperature measurements under water.)
Lets assume these measurements are on a regular grid.
I would like to find out where is the most variation in this distribution.
That means in what positions is the most change of temperature.
I think this corresponds to the spatial derivation of temperature.
Can somebody give me an advice how to proceed?
What are methodologies to archive this?
I tried to implement it with np.gradient() but i fail at interpreting the result...
This is absolutely not an optimized code, but here is what I came up with, at least to explain how it works.
grid = [[[1, 2], [2, 3]], [[8, 5], [4, 1000]]]
def get_greatest_diff(g, x, y, z):
value = g[x][y][z]
try:
diff_x = abs(value-g[x+1][y][z])
except IndexError:
diff_x = -1
try:
diff_y= abs(value-g[x][y+1][z])
except IndexError:
diff_y = -1
try:
diff_z = abs(value-g[x][y][z+1])
except IndexError:
diff_z = -1
if diff_x>=diff_y and diff_x>=diff_z:
return diff_x, [x+1, y, z]
if diff_y>diff_x and diff_y>=diff_z:
return diff_y, [x, y+1, z]
return diff_z, [x, y, z+1]
greatest_diff = 0
greatest_diff_pos0 = []
greatest_diff_pos1 = []
for x in range(len(grid)):
for y in range(len(grid[x])):
for z in range(len(grid[x][y])):
diff, coords = get_greatest_diff(grid, x, y, z)
if diff > greatest_diff:
greatest_diff = diff
greatest_diff_pos0 = [x, y, z]
greatest_diff_pos1 = coords
print(greatest_diff, greatest_diff_pos0, greatest_diff_pos1)
The try:...except:... are here to handle the edge conditions. (That's dirty but that's quick!)
For each cell, you will look at the three neighbours x+1 or y+1 or z+1 and you compute the difference with their values. You keep the largest difference in the neighborhood and you return it. (That is the explanation of get_greatest_diff)
In the main loop, you check if the difference in this neighborhood is the greatest of all, if so, store the difference, and the two cells in question.
Finally, return the greatest difference and the cells in question.
Here is a numpy solution that returns the indices in an ndarray that has the biggest total differences with its neighbors.
Say the input array is X and it is 2D. I will create D where D[i,j] = |X[i, j]-X[i-1, j]|+|X[i,j]-X[i, j-1]|. And return the indices of D which give the largest value in D.
def greatest_diff(X):
ndim = X.ndim
Ds = [np.abs(np.diff(X, axis = i, prepend=0)) for i in range(ndim)]
D = sum(Ds)
return np.unravel_index(D.argmax(), D.shape)
X = np.zeros((5,5))
X[2,2] = 1
greatest_diff(X)
# returns (2, 2)
X = np.zeros((5,10,9))
X[2,2,7] = -1
greatest_diff(X)
# returns (2, 2, 7)
Another solution might be calculating the difference between X[i, j] and sum(X[k, l]) where k,l are the neighbors of i, j. You can achieve this by applying a gaussian filter to the X say gX then taking the squared differences: (X-gX)^2.
def greatest_diff_gaussian(X, sigma = 1):
from scipy.ndimage import gaussian_filter
gX = gaussian_filter(X, sigma)
dgX = np.power(X - gX, 2)
return np.unravel_index(dgX.argmax(), dgX.shape)

More pythonic way of creating within-class scatter matrix

I am looking for a better way of calculating the following
import numpy as np
np.random.seed(123)
# test code
t = np.random.randint(3, size = 100)
X = np.random.random((100, 3))
m = np.random.random((3, 3))
# current method
res = 0
for k in np.unique(t):
for row in X[t == k] - m[k]:
res += np.outer(row, row)
res
"""
Output:
array([[12.45661335, -3.51124346, 3.75900294],
[-3.51124346, 14.85327689, -3.02281263],
[ 3.75900294, -3.02281263, 18.30868772]])
"""
I would prefer getting rid of the for loops using numpy.
This is the within-class scatter matrix for fischers linear discriminant.
You can write as follows:
Y = X - m[t]
np.matmul(Y.T, Y)
This is because sum_i x_i x'_i = X' X, where X is (N, 3) matrix and x_i = X[i,:], i.e. i-th row of X. ' indicates the transpose.

Implementation of a threshold detection function in Python

I want to implement following trigger function in Python:
Input:
time vector t [n dimensional numpy vector]
data vector y [n dimensional numpy vector] (values correspond to t vector)
threshold tr [float]
Threshold type vector tr_type [m dimensional list of int values]
Output:
Threshold time vector tr_time [m dimensional list of float values]
Function:
I would like to return tr_time which consists of the exact (preferred also interpolated which is not yet in code below) time values at which y is crossing tr (crossing means going from less then to greater then or the other way around). The different values in tr_time correspond to the tr_type vector: the elements of tr_type indicate the number of the crossing and if this is an upgoing or a downgoing crossing. For example 1 means first time y goes from less then tr to greater than tr, -3 means the third time y goes from greater then tr to less then tr (third time means along the time vector t)
For the moment I have next code:
import numpy as np
import matplotlib.pyplot as plt
def trigger(t, y, tr, tr_type):
triggermarker = np.diff(1 * (y > tr))
positiveindices = [i for i, x in enumerate(triggermarker) if x == 1]
negativeindices = [i for i, x in enumerate(triggermarker) if x == -1]
triggertime = []
for i in tr_type:
if i >= 0:
triggertime.append(t[positiveindices[i - 1]])
elif i < 0:
triggertime.append(t[negativeindices[i - 1]])
return triggertime
t = np.linspace(0, 20, 1000)
y = np.sin(t)
tr = 0.5
tr_type = [1, 2, -2]
print(trigger(t, y, tr, tr_type))
plt.plot(t, y)
plt.grid()
Now I'm pretty new to Python so I was wondering if there is a more Pythonic and more efficient way to implement this. For example without for loops or without the need to write separate code for upgoing or downgoing crossings.
You can use two masks: the first separates the value below and above the threshold, the second uses np.diff on the first mask: if the i and i+1 value are both below or above the threshold, np.diff yields 0:
import numpy as np
import matplotlib.pyplot as plt
t = np.linspace(0, 8 * np.pi, 400)
y = np.sin(t)
th = 0.5
mask = np.diff(1 * (y > th) != 0)
plt.plot(t, y, 'bx', markersize=3)
plt.plot(t[:-1][mask], y[:-1][mask], 'go', markersize=8)
Using the slice [:-1] will yield the index "immediately before" crossing the threshold (you can see that in the chart). if you want the index "immediately after" use [1:] instead of [:-1]

Numpy find covariance of two 2-dimensional ndarray

I am new to numpy and am stuck at this problem.
I have two 2-dimensional numpy array such as
x = numpy.random.random((10, 5))
y = numpy.random.random((10, 5))
I want to use numpy cov function to find covariance of these two ndarrays row wise. i.e., for above example the output array should consist of 10 elements each denoting the covariance of corresponding rows of the ndarrays. I know I can do this by traversing the rows and finding the covariance of two 1D arrays but it isn't pythonic.
Edit1: The covariance of two array denotes the element at 0, 1 index.
Edit2: Currently this is my implementation
s = numpy.empty((x.shape[0], 1))
for i in range(x.shape[0]):
s[i] = numpy.cov(x[i], y[i])[0][1]
Use the definition of the covariance: E(XY) - E(X)E(Y).
import numpy as np
x = np.random.random((10, 5))
y = np.random.random((10, 5))
n = x.shape[1]
cov_bias = np.mean(x * y, axis=1) - np.mean(x, axis=1) * np.mean(y, axis=1))
cov_bias * n / (n-1)
Note that cov_bias corresponds to the result of numpy.cov(bias=True).
Here's one using the definition of covariance and inspired by corr2_coeff_rowwise -
def covariance_rowwise(A,B):
# Rowwise mean of input arrays & subtract from input arrays themeselves
A_mA = A - A.mean(-1, keepdims=True)
B_mB = B - B.mean(-1, keepdims=True)
# Finally get covariance
N = A.shape[1]
return np.einsum('ij,ij->i',A_mA,B_mB)/(N-1)
Sample run -
In [66]: np.random.seed(0)
...: x = np.random.random((10, 5))
...: y = np.random.random((10, 5))
In [67]: s = np.empty((x.shape[0]))
...: for i in range(x.shape[0]):
...: s[i] = np.cov(x[i], y[i])[0][1]
In [68]: np.allclose(covariance_rowwise(x,y),s)
Out[68]: True
This works, but I'm not sure if it is faster for larger matrices x and y, the call numpy.cov(x, y) computes many entries we discard with numpy.diag:
x = numpy.random.random((10, 5))
y = numpy.random.random((10, 5))
# with loop
for (xi, yi) in zip(x, y):
print(numpy.cov(xi, yi)[0][1])
# vectorized
cov_mat = numpy.cov(x, y)
covariances = numpy.diag(cov_mat, x.shape[0])
print(covariances)
I also did some timing for square matrices of size n x n:
import time
import numpy
def run(n):
x = numpy.random.random((n, n))
y = numpy.random.random((n, n))
started = time.time()
for (xi, yi) in zip(x, y):
numpy.cov(xi, yi)[0][1]
needed_loop = time.time() - started
started = time.time()
cov_mat = numpy.cov(x, y)
covariances = numpy.diag(cov_mat, x.shape[0])
needed_vectorized = time.time() - started
print(
f"n={n:4d} needed_loop={needed_loop:.3f} s "
f"needed_vectorized={needed_vectorized:.3f} s"
)
for n in (100, 200, 500, 600, 700, 1000, 2000, 3000):
run(n)
output on my slow MacBook Air is
n= 100 needed_loop=0.006 s needed_vectorized=0.001 s
n= 200 needed_loop=0.011 s needed_vectorized=0.003 s
n= 500 needed_loop=0.033 s needed_vectorized=0.023 s
n= 600 needed_loop=0.041 s needed_vectorized=0.039 s
n= 700 needed_loop=0.043 s needed_vectorized=0.049 s
n=1000 needed_loop=0.061 s needed_vectorized=0.130 s
n=2000 needed_loop=0.137 s needed_vectorized=0.742 s
n=3000 needed_loop=0.224 s needed_vectorized=2.264 s
so the break even point is around n=600
Pick the diagonal vector of cov(x,y) and expand dims:
numpy.expand_dims(numpy.diag(numpy.cov(x,y),x.shape[0]),1)

Scipy Fast 1-D interpolation without any loop

I have two 2D array, x(ni, nj) and y(ni,nj), that I need to interpolate over one axis. I want to interpolate along last axis for every ni.
I wrote
import numpy as np
from scipy.interpolate import interp1d
z = np.asarray([200,300,400,500,600])
out = []
for i in range(ni):
f = interp1d(x[i,:], y[i,:], kind='linear')
out.append(f(z))
out = np.asarray(out)
However, I think this method is inefficient and slow due to loop if array size is too large. What is the fastest way to interpolate multi-dimensional array like this? Is there any way to perform linear and cubic interpolation without loop? Thanks.
The method you propose does have a python loop, so for large values of ni it is going to get slow. That said, unless you are going to have large ni you shouldn't worry much.
I have created sample input data with the following code:
def sample_data(n_i, n_j, z_shape) :
x = np.random.rand(n_i, n_j) * 1000
x.sort()
x[:,0] = 0
x[:, -1] = 1000
y = np.random.rand(n_i, n_j)
z = np.random.rand(*z_shape) * 1000
return x, y, z
And have tested them with this two versions of linear interpolation:
def interp_1(x, y, z) :
rows, cols = x.shape
out = np.empty((rows,) + z.shape, dtype=y.dtype)
for j in xrange(rows) :
out[j] =interp1d(x[j], y[j], kind='linear', copy=False)(z)
return out
def interp_2(x, y, z) :
rows, cols = x.shape
row_idx = np.arange(rows).reshape((rows,) + (1,) * z.ndim)
col_idx = np.argmax(x.reshape(x.shape + (1,) * z.ndim) > z, axis=1) - 1
ret = y[row_idx, col_idx + 1] - y[row_idx, col_idx]
ret /= x[row_idx, col_idx + 1] - x[row_idx, col_idx]
ret *= z - x[row_idx, col_idx]
ret += y[row_idx, col_idx]
return ret
interp_1 is an optimized version of your code, following Dave's answer. interp_2 is a vectorized implementation of linear interpolation that avoids any python loop whatsoever. Coding something like this requires a sound understanding of broadcasting and indexing in numpy, and some things are going to be less optimized than what interp1d does. A prime example being finding the bin in which to interpolate a value: interp1d will surely break out of loops early once it finds the bin, the above function is comparing the value to all bins.
So the result is going to be very dependent on what n_i and n_j are, and even how long your array z of values to interpolate is. If n_j is small and n_i is large, you should expect an advantage from interp_2, and from interp_1 if it is the other way around. Smaller z should be an advantage to interp_2, longer ones to interp_1.
I have actually timed both approaches with a variety of n_i and n_j, for z of shape (5,) and (50,), here are the graphs:
So it seems that for z of shape (5,) you should go with interp_2 whenever n_j < 1000, and with interp_1 elsewhere. Not surprisingly, the threshold is different for z of shape (50,), now being around n_j < 100. It seems tempting to conclude that you should stick with your code if n_j * len(z) > 5000, but change it to something like interp_2 above if not, but there is a great deal of extrapolating in that statement! If you want to further experiment yourself, here's the code I used to produce the graphs.
n_s = np.logspace(1, 3.3, 25)
int_1 = np.empty((len(n_s),) * 2)
int_2 = np.empty((len(n_s),) * 2)
z_shape = (5,)
for i, n_i in enumerate(n_s) :
print int(n_i)
for j, n_j in enumerate(n_s) :
x, y, z = sample_data(int(n_i), int(n_j), z_shape)
int_1[i, j] = min(timeit.repeat('interp_1(x, y, z)',
'from __main__ import interp_1, x, y, z',
repeat=10, number=1))
int_2[i, j] = min(timeit.repeat('interp_2(x, y, z)',
'from __main__ import interp_2, x, y, z',
repeat=10, number=1))
cs = plt.contour(n_s, n_s, np.transpose(int_1-int_2))
plt.clabel(cs, inline=1, fontsize=10)
plt.xlabel('n_i')
plt.ylabel('n_j')
plt.title('timeit(interp_2) - timeit(interp_1), z.shape=' + str(z_shape))
plt.show()
One optimization is to allocate the result array once like so:
import numpy as np
from scipy.interpolate import interp1d
z = np.asarray([200,300,400,500,600])
out = np.zeros( [ni, len(z)], dtype=np.float32 )
for i in range(ni):
f = interp1d(x[i,:], y[i,:], kind='linear')
out[i,:]=f(z)
This will save you some memory copying that occurs in your implementation, which occurs in the calls to out.append(...).

Categories