Using array indexing to apply 2D array function on 3D array - python

I wrote a function that takes in one set of randomized cartesian coordinates and returns the subset that remains within some spatial domain. To illustrate:
grid = np.ones((5,5))
grid = np.lib.pad(grid, ((10,10), (10,10)), 'constant')
>> np.shape(grid)
(25, 25)
random_pts = np.random.random(size=(100, 2)) * len(grid)
def inside(input):
idx = np.floor(input).astype(np.int)
mask = grid[idx[:,0], idx[:,1]] == 1
return input[mask]
>> inside(random_pts)
array([[ 10.59441506, 11.37998288],
[ 10.39124766, 13.27615815],
[ 12.28225713, 10.6970708 ],
[ 13.78351949, 12.9933591 ]])
But now I want the ability to simultaneously generate n sets of random_pts and keep n corresponding subsets that satisfy the same functional condition. So, if n=3,
random_pts = np.random.random(size=(3, 100, 2)) * len(grid)
Without resorting to for loop, how could I index my variables such that inside(random_pts) returns something like
array([[[ 17.73323523, 9.81956681],
[ 10.97074592, 2.19671642],
[ 21.12081044, 12.80412997]],
[[ 11.41995519, 2.60974757]],
[[ 9.89827156, 9.74580059],
[ 17.35840479, 7.76972241]]])

One approach -
def inside3d(input):
# Get idx in 3D
idx3d = np.floor(input).astype(np.int)
# Create a similar mask as witrh 2D case, but in 3D now
mask3d = grid[idx3d[:,:,0], idx3d[:,:,1]]==1
# Count of mask matches for each index in 0th dim
counts = np.sum(mask3d,axis=1)
# Index into input to get masked matches across all elements in 0th dim
out_cat_array = input.reshape(-1,2)[mask3d.ravel()]
# Split the rows based on the counts, as the final output
return np.split(out_cat_array,counts.cumsum()[:-1])
Verify results -
Create 3D random input:
In [91]: random_pts3d = np.random.random(size=(3, 100, 2)) * len(grid)
With inside3d:
In [92]: inside3d(random_pts3d)
Out[92]:
[array([[ 10.71196268, 12.9875877 ],
[ 10.29700184, 10.00506662],
[ 13.80111411, 14.80514828],
[ 12.55070282, 14.63155383]]), array([[ 10.42636137, 12.45736944],
[ 11.26682474, 13.01632751],
[ 13.23550598, 10.99431284],
[ 14.86871413, 14.19079225],
[ 10.61103434, 14.95970597]]), array([[ 13.67395756, 10.17229061],
[ 10.01518846, 14.95480515],
[ 12.18167251, 12.62880968],
[ 11.27861513, 14.45609646],
[ 10.895685 , 13.35214678],
[ 13.42690335, 13.67224414]])]
With inside:
In [93]: inside(random_pts3d[0])
Out[93]:
array([[ 10.71196268, 12.9875877 ],
[ 10.29700184, 10.00506662],
[ 13.80111411, 14.80514828],
[ 12.55070282, 14.63155383]])
In [94]: inside(random_pts3d[1])
Out[94]:
array([[ 10.42636137, 12.45736944],
[ 11.26682474, 13.01632751],
[ 13.23550598, 10.99431284],
[ 14.86871413, 14.19079225],
[ 10.61103434, 14.95970597]])
In [95]: inside(random_pts3d[2])
Out[95]:
array([[ 13.67395756, 10.17229061],
[ 10.01518846, 14.95480515],
[ 12.18167251, 12.62880968],
[ 11.27861513, 14.45609646],
[ 10.895685 , 13.35214678],
[ 13.42690335, 13.67224414]])

Related

How tp speed up looping arrays as inputs for pandas calculation?

I have two arrays named x and y. The goal is to iterate them as the input for pandas calculation.
Here's an example.
Iterating each x and y and appending the calculation result to the res list is slow.
The calculation is to get the exponential of each column modified by a and then sum together, multiply with b. Anyway, this calculation can be replaced by any other calculations.
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0,5,size=(5, 1)),columns=['data'])
x = np.linspace(1, 24, 4)
y = np.linspace(10, 1500, 5)
res = []
for a in x:
for b in y:
res.append(np.exp(-df/a).sum().values[0]*b)
res = np.array(res).reshape(4, 5)
expected output:
array([[ 11.67676844, 446.63639283, 881.59601721, 1316.5556416 ,
1751.51526599],
[ 37.52524129, 1435.34047927, 2833.15571725, 4230.97095523,
5628.78619321],
[ 42.79406912, 1636.87314392, 3230.95221871, 4825.0312935 ,
6419.1103683 ],
[ 44.93972433, 1718.94445549, 3392.94918665, 5066.95391781,
6740.95864897]])
You can use numpy broadcasting:
res = np.array(res).reshape(4, 5)
print (res)
[[ 11.67676844 446.63639283 881.59601721 1316.5556416 1751.51526599]
[ 37.52524129 1435.34047927 2833.15571725 4230.97095523 5628.78619321]
[ 42.79406912 1636.87314392 3230.95221871 4825.0312935 6419.1103683 ]
[ 44.93972433 1718.94445549 3392.94918665 5066.95391781 6740.95864897]]
res = np.exp(-df.to_numpy()/x).sum(axis=0)[:, None] * y
print (res)
[[ 11.67676844 446.63639283 881.59601721 1316.5556416 1751.51526599]
[ 37.52524129 1435.34047927 2833.15571725 4230.97095523 5628.78619321]
[ 42.79406912 1636.87314392 3230.95221871 4825.0312935 6419.1103683 ]
[ 44.93972433 1718.94445549 3392.94918665 5066.95391781 6740.95864897]]
I think what you want is:
z = -df['data'].to_numpy()
res = np.exp(z/x[:, None]).sum(axis=1)[:, None]*y
output:
array([[ 11.67676844, 446.63639283, 881.59601721, 1316.5556416 ,
1751.51526599],
[ 37.52524129, 1435.34047927, 2833.15571725, 4230.97095523,
5628.78619321],
[ 42.79406912, 1636.87314392, 3230.95221871, 4825.0312935 ,
6419.1103683 ],
[ 44.93972433, 1718.94445549, 3392.94918665, 5066.95391781,
6740.95864897]])

Numpyic way to sort a matrix based on another similar matrix

Say I have a matrix Y of random float numbers from 0 to 10 with shape (10, 3):
import numpy as np
np.random.seed(99)
Y = np.random.uniform(0, 10, (10, 3))
print(Y)
Output:
[[6.72278559 4.88078399 8.25495174]
[0.31446388 8.08049963 5.6561742 ]
[2.97622499 0.46695721 9.90627399]
[0.06825733 7.69793028 7.46767101]
[3.77438936 4.94147452 9.28948392]
[3.95454044 9.73956297 5.24414715]
[0.93613093 8.13308413 2.11686786]
[5.54345785 2.92269116 8.1614236 ]
[8.28042566 2.21577372 6.44834702]
[0.95181622 4.11663239 0.96865261]]
I am now given a matrix X with same shape that can be seen as obtained by adding small noises to Y and then shuffling the rows:
X = np.random.normal(Y, scale=0.1)
np.random.shuffle(X)
print(X)
Output:
[[ 4.04067271 9.90959141 5.19126867]
[ 5.59873104 2.84109306 8.11175891]
[ 0.10743952 7.74620162 7.51100441]
[ 3.60396019 4.91708372 9.07551354]
[ 0.9400948 4.15448712 1.04187208]
[ 2.91884302 0.47222752 10.12700505]
[ 0.30995155 8.09263241 5.74876947]
[ 1.11247872 8.02092335 1.99767444]
[ 6.68543696 4.8345869 8.17330513]
[ 8.38904822 2.11830619 6.42013343]]
Now I want to sort the matrix X based on Y by row. I already know each pair of column values in each matching pair of rows are not different from each other more than a tolerance of 0.5. I managed to write the following code and it is working fine.
def sort_X_by_Y(X, Y, tol):
idxs = [next(i for i in range(len(X)) if all(abs(X[i] - row) <= tol)) for row in Y]
return X[idxs]
print(sort_X_by_Y(X, Y, tol=0.5))
Output:
[[ 6.68543696 4.8345869 8.17330513]
[ 0.30995155 8.09263241 5.74876947]
[ 2.91884302 0.47222752 10.12700505]
[ 0.10743952 7.74620162 7.51100441]
[ 3.60396019 4.91708372 9.07551354]
[ 4.04067271 9.90959141 5.19126867]
[ 1.11247872 8.02092335 1.99767444]
[ 5.59873104 2.84109306 8.11175891]
[ 8.38904822 2.11830619 6.42013343]
[ 0.9400948 4.15448712 1.04187208]]
However, in reality I am sorting (1000, 3) matrices and my code is way too slow. I feel like there should be more numpyic way to code this. Any suggestions?
This is a vectorized version of your algorithm. It runs ~26.5x faster than your implementation for 1000 samples. But an additional boolean array with shape (1000,1000,3) is created. There is a chance that rows will have similar values within the tolerance and a wrong row is selected.
tol = .5
X[(np.abs(Y[:, np.newaxis] - X) <= tol).all(2).argmax(1)]
Output
array([[ 6.68543696, 4.8345869 , 8.17330513],
[ 0.30995155, 8.09263241, 5.74876947],
[ 2.91884302, 0.47222752, 10.12700505],
[ 0.10743952, 7.74620162, 7.51100441],
[ 3.60396019, 4.91708372, 9.07551354],
[ 4.04067271, 9.90959141, 5.19126867],
[ 1.11247872, 8.02092335, 1.99767444],
[ 5.59873104, 2.84109306, 8.11175891],
[ 8.38904822, 2.11830619, 6.42013343],
[ 0.9400948 , 4.15448712, 1.04187208]])
More robust solutions with L1-norm
X[np.abs(Y[:, np.newaxis] - X).sum(2).argmin(1)]
Or L2-norm
X[((Y[:, np.newaxis] - X)**2).sum(2).argmin(1)]

How to split a 3D matrix into 3D matrices lined up in a list?

I have a NumPy array with the following shape:
(1532, 2036, 5)
I would like to generate a list of arrays where each one has the following shape:
(1532, 2036)
You can use Ellipsis to signify all dimensions up to the last. For example:
arr = np.random.rand(4, 3, 2)
arr
array([[[ 0.35235813, 0.57984153],
[ 0.53743048, 0.46753367],
[ 0.80048303, 0.07982378]],
[[ 0.1339381 , 0.84586721],
[ 0.81425027, 0.41086151],
[ 0.34039991, 0.19972737]],
[[ 0.2112466 , 0.73086434],
[ 0.03755819, 0.40113463],
[ 0.74622891, 0.74695994]],
[[ 0.99313615, 0.65634951],
[ 0.90787642, 0.37387861],
[ 0.8738962 , 0.41747727]]])
The list of the last dimension arrays can be constructed as #Usernamenotfound mentioned or with Ellipsis like so:
[arr[..., i] for i in range(arr.shape[-1])]
[array([[ 0.35235813, 0.53743048, 0.80048303],
[ 0.1339381 , 0.81425027, 0.34039991],
[ 0.2112466 , 0.03755819, 0.74622891],
[ 0.99313615, 0.90787642, 0.8738962 ]]),
array([[ 0.57984153, 0.46753367, 0.07982378],
[ 0.84586721, 0.41086151, 0.19972737],
[ 0.73086434, 0.40113463, 0.74695994],
[ 0.65634951, 0.37387861, 0.41747727]])]
Each element has the shape (4, 3).
Likewise you could so the same for the first dimension, making 4 (3, 2) arrays.
[arr[i, ...] for i in range(arr.shape[0])]
[array([[ 0.35235813, 0.57984153],
[ 0.53743048, 0.46753367],
[ 0.80048303, 0.07982378]]), array([[ 0.1339381 , 0.84586721],
[ 0.81425027, 0.41086151],
[ 0.34039991, 0.19972737]]), array([[ 0.2112466 , 0.73086434],
[ 0.03755819, 0.40113463],
[ 0.74622891, 0.74695994]]), array([[ 0.99313615, 0.65634951],
[ 0.90787642, 0.37387861],
[ 0.8738962 , 0.41747727]])]
You can also permute the axes with numpy.transpose then simply iterate through the array:
import numpy as np
a = ... # Define the input array here
out = [a for a in np.transpose(arr, (2, 0, 1))]
You can slice the 3D array using
[x[:,:,i] for i in range(5)]
The above would give you a list of 2D arrays.
The same process can be scaled for multidimensional arrays

What is the fastest way to prepare data for RNN with numpy?

I currently have a (1631160,78) np array as my input to a neural network. I would like to try something with LSTM which requires a 3D structure as input data. I'm currently using the following code to generate the 3D structure needed but it is super slow (ETA > 1day). Is there a better way to do this with numpy?
My current code to generate data:
def transform_for_rnn(input_x, input_y, window_size):
output_x = None
start_t = time.time()
for i in range(len(input_x)):
if i > 100 and i % 100 == 0:
sys.stdout.write('\rTransform Data: %d/%d\tETA:%s'%(i, len(input_x), str(datetime.timedelta(seconds=(time.time()-start_t)/i * (len(input_x) - i)))))
sys.stdout.flush()
if output_x is None:
output_x = np.array([input_x[i:i+window_size, :]])
else:
tmp = np.array([input_x[i:i+window_size, :]])
output_x = np.concatenate((output_x, tmp))
print
output_y = input_y[window_size:]
assert len(output_x) == len(output_y)
return output_x, output_y
Here's an approach using NumPy strides to vectorize the creation of output_x -
nrows = input_x.shape[0] - window_size + 1
p,q = input_x.shape
m,n = input_x.strides
strided = np.lib.stride_tricks.as_strided
out = strided(input_x,shape=(nrows,window_size,q),strides=(m,m,n))
Sample run -
In [83]: input_x
Out[83]:
array([[ 0.73089384, 0.98555845, 0.59818726],
[ 0.08763718, 0.30853945, 0.77390923],
[ 0.88835985, 0.90506367, 0.06204614],
[ 0.21791334, 0.77523643, 0.47313278],
[ 0.93324799, 0.61507976, 0.40587073],
[ 0.49462016, 0.00400835, 0.66401908]])
In [84]: window_size = 4
In [85]: out
Out[85]:
array([[[ 0.73089384, 0.98555845, 0.59818726],
[ 0.08763718, 0.30853945, 0.77390923],
[ 0.88835985, 0.90506367, 0.06204614],
[ 0.21791334, 0.77523643, 0.47313278]],
[[ 0.08763718, 0.30853945, 0.77390923],
[ 0.88835985, 0.90506367, 0.06204614],
[ 0.21791334, 0.77523643, 0.47313278],
[ 0.93324799, 0.61507976, 0.40587073]],
[[ 0.88835985, 0.90506367, 0.06204614],
[ 0.21791334, 0.77523643, 0.47313278],
[ 0.93324799, 0.61507976, 0.40587073],
[ 0.49462016, 0.00400835, 0.66401908]]])
This creates a view into the input array and as such memory-wise we are being efficient. In most cases, this should translate to benefits on performance too with further operations involving it. Let's verify that its a view indeed -
In [86]: np.may_share_memory(out,input_x)
Out[86]: True # Doesn't guarantee, but is sufficient in most cases
Another sure-shot way to verify would be to set some values into output and check the input -
In [87]: out[0] = 0
In [88]: input_x
Out[88]:
array([[ 0. , 0. , 0. ],
[ 0. , 0. , 0. ],
[ 0. , 0. , 0. ],
[ 0. , 0. , 0. ],
[ 0.93324799, 0.61507976, 0.40587073],
[ 0.49462016, 0.00400835, 0.66401908]])

Array-Based Numpy 3d Array Assignment

Take a 2D numpy.array, let's say:
mat = numpy.random.rand(3,3)
In [153]: mat
Out[153]:
array([[ 0.16716156, 0.90822617, 0.83888038],
[ 0.89771815, 0.62627978, 0.34992542],
[ 0.11097042, 0.80858005, 0.0437299 ]])
Changes the indices to numpy.nan is quite straight forward
One of the following works great:
In [154]: diag = numpy.diag_indices(mat.shape[0], ndim = 2)
In [155]: mat[diag] = numpy.nan
or
In [156]: numpy.fill_diagonal(mat, numpy.nan)
But let's say I have a 3D array, where I want the exact same process along every dimension of the 3rd dimension.
mat = numpy.random.rand(3, 5, 5)
In [158]: mat
Out[158]:
array([[[ 0.65000325, 0.71059547, 0.31880388, 0.24818623, 0.57722849],
[ 0.26908326, 0.41962004, 0.78642476, 0.25711662, 0.8662998 ],
[ 0.15332566, 0.12633147, 0.54032977, 0.17322095, 0.17210078],
[ 0.81952873, 0.20751669, 0.73514815, 0.00884358, 0.89222687],
[ 0.62775839, 0.53657471, 0.99611842, 0.75051645, 0.59328044]],
[[ 0.28718216, 0.84982865, 0.27830082, 0.90604492, 0.43119512],
[ 0.43039373, 0.76557782, 0.58089787, 0.81135684, 0.39151152],
[ 0.70592711, 0.30625204, 0.9753166 , 0.32806864, 0.21947731],
[ 0.74600317, 0.33711673, 0.16203076, 0.6002213 , 0.74996638],
[ 0.63555715, 0.71719058, 0.81420001, 0.28968442, 0.01368163]],
[[ 0.06474027, 0.51966572, 0.006429 , 0.98590784, 0.35708074],
[ 0.44977222, 0.63719921, 0.88325451, 0.53820139, 0.51526687],
[ 0.98529117, 0.46219441, 0.09349748, 0.11406291, 0.47697128],
[ 0.77446136, 0.87423445, 0.71810465, 0.39019846, 0.94070077],
[ 0.09154989, 0.36295161, 0.19740833, 0.17803146, 0.6498038 ]]])
A logical way to do that (I would think), is:
mat[:, diag] = numpy.nan # doesn't do it
In fact, to accomplish this, I need to:
In [190]: rng = numpy.arange(5)
In [191]: for i in numpy.arange(mat.shape[0]):
.....: mat[i, rng, rng] = numpy.nan
.....:
In [192]: mat
Out[192]:
array([[[ nan, 0.4040426 , 0.89449522, 0.63593736, 0.94922036],
[ 0.40682651, nan, 0.30812181, 0.01726625, 0.75655994],
[ 0.23925763, 0.41476223, nan, 0.91590111, 0.18391644],
[ 0.99784977, 0.71636554, 0.21252766, nan, 0.24195636],
[ 0.41137357, 0.84705055, 0.60086461, 0.16403918, nan]],
[[ nan, 0.26183712, 0.77621913, 0.5479058 , 0.17142263],
[ 0.17969373, nan, 0.89742863, 0.65698339, 0.95817106],
[ 0.79048886, 0.16365168, nan, 0.97394435, 0.80612441],
[ 0.94169129, 0.10895737, 0.92614597, nan, 0.08689534],
[ 0.20324943, 0.91402716, 0.23112819, 0.2556875 , nan]],
[[ nan, 0.43177039, 0.76901587, 0.82069345, 0.64351534],
[ 0.14148584, nan, 0.35820379, 0.17434688, 0.78884305],
[ 0.85232784, 0.93526843, nan, 0.80981366, 0.57326785],
[ 0.82104636, 0.63453196, 0.5872653 , nan, 0.96214559],
[ 0.69959383, 0.70257404, 0.92471502, 0.50077728, nan]]])
It's for an application where speed is of the utmost importance, so if there isn't an array based implementation of the following, I'm going to do the for-loop / assignment in Cython
This seems to work:
diag = numpy.diag_indices(mat.shape[1], ndim = 2)
mat[:, diag[0], diag[1]] = numpy.nan
The problem is that diag is a 2-element tuple, so using it as-is in a 3D index won't work, and using *diag us unfortunately invalid syntax. However, you can also do this:
diag = (Ellipsis, *numpy.diag_indices(mat.shape[-1], ndim = 2))
mat[diag] = numpy.nan
In this case, diag is the three-element tuple you need to use it as an index. Ellipsis is the object that represents : repeated as many times as necessary in the index. This version will work for any number of dimensions >2 where the last two represent the square matrices you want.
Using linear indexing -
m,n,r = mat.shape
mat.reshape(m,-1)[:,np.arange(r)*(r+1)] = np.nan
Using slicing and boolean indexing -
m,n,r = mat.shape
mat.reshape(m,-1)[:,np.eye(n,r,dtype=bool).ravel()] = np.nan

Categories