Efficiently select subsection of numpy array - python

I want to split a numpy array into three different arrays based on a logical comparison. The numpy array I want to split is called x. It's shape looks as follows, but it's entries vary: (In response to Saullo Castro's comment I included a slightly different array x.)
array([[ 0.46006547, 0.5580928 , 0.70164242, 0.84519205, 1.4 ],
[ 0.00912908, 0.00912908, 0.05 , 0.05 , 0.05 ]])
This values of this array are monotonically increasing along columns. I also have two other arrays called lowest_gridpoints and highest_gridpoints. The entries of these arrays also vary, but the shape is always identical to the following:
array([ 0.633, 0.01 ]), array([ 1.325, 0.99 ])
The selection procedure I want to apply is as follows:
All columns containing values lower than any value in lowest_gridpoints should be removed from x and constitute the array temp1.
All columns containing values higher than any value in highest_gridpoints should be removed from x and constitute the array temp2.
All columns of x that are included in neither temp1 or temp2 constitute the array x_new.
The following code I wrote achieves the task.
if np.any( x[:,-1] > highest_gridpoints ) or np.any( x[:,0] < lowest_gridpoints ):
for idx, sample, in enumerate(x.T):
if np.any( sample > highest_gridpoints):
max_idx = idx
break
elif np.any( sample < lowest_gridpoints ):
min_idx = idx
temp1, temp2 = np.array([[],[]]), np.array([[],[]])
if 'min_idx' in locals():
temp1 = x[:,0:min_idx+1]
if 'max_idx' in locals():
temp2 = x[:,max_idx:]
if 'min_idx' in locals() or 'max_idx' in locals():
if 'min_idx' not in locals():
min_idx = -1
if 'max_idx' not in locals():
max_idx = x.shape[1]
x_new = x[:,min_idx+1:max_idx]
However, I suspect that this code is very inefficient because of the heavy use of loops. Additionally, I think the syntax is bloated.
Does someone have an idea for a code which achieve the task outlined above more efficiently or looks concise?

Only the first part of your question
from numpy import *
x = array([[ 0.46006547, 0.5580928 , 0.70164242, 0.84519205, 1.4 ],
[ 0.00912908, 0.00912908, 0.05 , 0.05 , 0.05 ]])
low, high = array([ 0.633, 0.01 ]), array([ 1.325, 0.99 ])
# construct an array of two rows of bools expressing your conditions
indices1 = array((x[0,:]<low[0], x[1,:]<low[1]))
print indices1
# do an or of the values along the first axis
indices1 = any(indices1, axis=0)
# now it's a single row array
print indices1
# use the indices1 to extract what you want,
# the double transposition because the elements
# of a 2d array are the rows
tmp1 = x.T[indices1].T
print tmp1
# [[ True True False False False]
# [ True True False False False]]
# [ True True False False False]
# [[ 0.46006547 0.5580928 ]
# [ 0.00912908 0.00912908]]
next construct similarly indices2 and tmp2, the indices of the remnant are the negation of the oring of the first two indices. (i.e., numpy.logical_not(numpy.logical_or(i1,i2))).
Addendum
Another approach, possibly faster if you have thousands of entries, implies numpy.searchsorted
from numpy import *
x = array([[ 0.46006547, 0.5580928 , 0.70164242, 0.84519205, 1.4 ],
[ 0.00912908, 0.00912908, 0.05 , 0.05 , 0.05 ]])
low, high = array([ 0.633, 0.01 ]), array([ 1.325, 0.99 ])
l0r = searchsorted(x[0,:], low[0], side='right')
l1r = searchsorted(x[1,:], low[1], side='right')
h0l = searchsorted(x[0,:], high[0], side='left')
h1l = searchsorted(x[1,:], high[1], side='left')
lr = max(l0r, l1r)
hl = min(h0l, h1l)
print lr, hl
print x[:,:lr]
print x[:,lr:hl]
print x[:,hl]
# 2 4
# [[ 0.46006547 0.5580928 ]
# [ 0.00912908 0.00912908]]
# [[ 0.70164242 0.84519205]
# [ 0.05 0.05 ]]
# [ 1.4 0.05]
Excluding overlaps can be obtained by hl = max(lr, hl). NB in previuos approach the array slices are copied to new objects, here you get views on x and you have to be explicit if you want new objects.
Edit An unnecessary optimization
If we use only the upper part of x in the second couple of sortedsearches (if you look at the code you'll see what I mean...) we get two benefits, 1) a very small speedup of the searches (sortedsearch is always fast enough) and 2) the case of overlap is automatically managed.
As a bonus, code for copying the segments of x in the new arrays. NB x was changed to force overlap
from numpy import *
# I changed x to force overlap
x = array([[ 0.46006547, 1.4 , 1.4, 1.4, 1.4 ],
[ 0.00912908, 0.00912908, 0.05, 0.05, 0.05 ]])
low, high = array([ 0.633, 0.01 ]), array([ 1.325, 0.99 ])
l0r = searchsorted(x[0,:], low[0], side='right')
l1r = searchsorted(x[1,:], low[1], side='right')
lr = max(l0r, l1r)
h0l = searchsorted(x[0,lr:], high[0], side='left')
h1l = searchsorted(x[1,lr:], high[1], side='left')
hl = min(h0l, h1l) + lr
t1 = x[:,range(lr)]
xn = x[:,range(lr,hl)]
ncol = shape(x)[1]
t2 = x[:,range(hl,ncol)]
print x
del(x)
print
print t1
print
# note that xn is a void array
print xn
print
print t2
# [[ 0.46006547 1.4 1.4 1.4 1.4 ]
# [ 0.00912908 0.00912908 0.05 0.05 0.05 ]]
#
# [[ 0.46006547 1.4 ]
# [ 0.00912908 0.00912908]]
#
# []
#
# [[ 1.4 1.4 1.4 ]
# [ 0.05 0.05 0.05]]

Related

Numpyic way to sort a matrix based on another similar matrix

Say I have a matrix Y of random float numbers from 0 to 10 with shape (10, 3):
import numpy as np
np.random.seed(99)
Y = np.random.uniform(0, 10, (10, 3))
print(Y)
Output:
[[6.72278559 4.88078399 8.25495174]
[0.31446388 8.08049963 5.6561742 ]
[2.97622499 0.46695721 9.90627399]
[0.06825733 7.69793028 7.46767101]
[3.77438936 4.94147452 9.28948392]
[3.95454044 9.73956297 5.24414715]
[0.93613093 8.13308413 2.11686786]
[5.54345785 2.92269116 8.1614236 ]
[8.28042566 2.21577372 6.44834702]
[0.95181622 4.11663239 0.96865261]]
I am now given a matrix X with same shape that can be seen as obtained by adding small noises to Y and then shuffling the rows:
X = np.random.normal(Y, scale=0.1)
np.random.shuffle(X)
print(X)
Output:
[[ 4.04067271 9.90959141 5.19126867]
[ 5.59873104 2.84109306 8.11175891]
[ 0.10743952 7.74620162 7.51100441]
[ 3.60396019 4.91708372 9.07551354]
[ 0.9400948 4.15448712 1.04187208]
[ 2.91884302 0.47222752 10.12700505]
[ 0.30995155 8.09263241 5.74876947]
[ 1.11247872 8.02092335 1.99767444]
[ 6.68543696 4.8345869 8.17330513]
[ 8.38904822 2.11830619 6.42013343]]
Now I want to sort the matrix X based on Y by row. I already know each pair of column values in each matching pair of rows are not different from each other more than a tolerance of 0.5. I managed to write the following code and it is working fine.
def sort_X_by_Y(X, Y, tol):
idxs = [next(i for i in range(len(X)) if all(abs(X[i] - row) <= tol)) for row in Y]
return X[idxs]
print(sort_X_by_Y(X, Y, tol=0.5))
Output:
[[ 6.68543696 4.8345869 8.17330513]
[ 0.30995155 8.09263241 5.74876947]
[ 2.91884302 0.47222752 10.12700505]
[ 0.10743952 7.74620162 7.51100441]
[ 3.60396019 4.91708372 9.07551354]
[ 4.04067271 9.90959141 5.19126867]
[ 1.11247872 8.02092335 1.99767444]
[ 5.59873104 2.84109306 8.11175891]
[ 8.38904822 2.11830619 6.42013343]
[ 0.9400948 4.15448712 1.04187208]]
However, in reality I am sorting (1000, 3) matrices and my code is way too slow. I feel like there should be more numpyic way to code this. Any suggestions?
This is a vectorized version of your algorithm. It runs ~26.5x faster than your implementation for 1000 samples. But an additional boolean array with shape (1000,1000,3) is created. There is a chance that rows will have similar values within the tolerance and a wrong row is selected.
tol = .5
X[(np.abs(Y[:, np.newaxis] - X) <= tol).all(2).argmax(1)]
Output
array([[ 6.68543696, 4.8345869 , 8.17330513],
[ 0.30995155, 8.09263241, 5.74876947],
[ 2.91884302, 0.47222752, 10.12700505],
[ 0.10743952, 7.74620162, 7.51100441],
[ 3.60396019, 4.91708372, 9.07551354],
[ 4.04067271, 9.90959141, 5.19126867],
[ 1.11247872, 8.02092335, 1.99767444],
[ 5.59873104, 2.84109306, 8.11175891],
[ 8.38904822, 2.11830619, 6.42013343],
[ 0.9400948 , 4.15448712, 1.04187208]])
More robust solutions with L1-norm
X[np.abs(Y[:, np.newaxis] - X).sum(2).argmin(1)]
Or L2-norm
X[((Y[:, np.newaxis] - X)**2).sum(2).argmin(1)]

Find all the intervals in numpy

I have xy numpy array with shape(2,600)
array([[0. , 0.01 ],
[0.02 , 0.03 ],
[0.04 , 0.05 ],
...,
[1.21943121, 1.14205236],
[1.07493206, 1.01916783],
[0.97570154, 0.94530397]])
I need to find all the intervals in which the values of the second demention is less than zero. Mark them as + and print them with index from first dimention.
Output example:
[0.00 0.04] +
[0.04 0.08] -
[0.08 0.10] +
I would be very grateful if you could help me!
This would filter out the intervals that have second value less than 0.
filter = [i for i in arr if i[1] < 0]
If you need the indices then,
ind = []
for i in range(len(arr)):
if arr[i][1] < 0:
ind.append(i)
In Numpy, it would be simply:
np.where(arr[:,1] < 0)
This will get you the indices.

tensorflow collect similar values from a list

I have a tensor as follows:
arr = [[1.5,0.2],[2.3,0.1],[1.3,0.21],[2.2,0.09],[4.4,0.8]]
I would like to collect small arrays whose difference of first elements are within 0.3 and second elements are within 0.03.
For example [1.5,0.2] and [1.3,0.21] should belong to a same category. The difference of their first elements is 0.2<0.3 and second 0.01<0.03.
I want a tensor looks like this
arr = {[[1.5,0.2],[1.3,0.21]],[[2.3,0.1],[2.2,0.09]]}
How to do this in tensorflow? Eager mode is ok.
I found a way which is a bit ugly and slow:
samples = np.array([[1.5,0.2],[2.3,0.1],[1.3,0.2],[2.2,0.09],[4.4,0.8],[2.3,0.11]],dtype=np.float32)
ini_samples = samples
samples = tf.split(samples,2,1)
a = samples[0]
b = samples[1]
find_match1 = tf.reduce_sum(tf.abs(tf.expand_dims(a,0) - tf.expand_dims(a,1)),2)
a = tf.logical_and(tf.greater(find_match1, tf.zeros_like(find_match1)),tf.less(find_match1, 0.3*tf.ones_like(find_match1)))
find_match2 = tf.reduce_sum(tf.abs(tf.expand_dims(b,0) - tf.expand_dims(b,1)),2)
b = tf.logical_and(tf.greater(find_match2, tf.zeros_like(find_match2)),tf.less(find_match2, 0.03*tf.ones_like(find_match2)))
x,y = tf.unique(tf.reshape(tf.where(tf.logical_or(a,b)),[1,-1])[0])
r = tf.gather(ini_samples, x)
Does tensorflow have more elegant functions?
You cannot get a result composed of "groups" of vectors with different sizes. Instead, you can make a "group id" tensor that classifies each vector into a group according to your criteria. The part that makes this a bit more complicated is that you have to "fuse" groups with common elements, which I think can only be done with a loop. This code does something like that:
import tensorflow as tf
def make_groups(correspondences):
# Multiply each row by its index
m = tf.to_int32(correspondences) * tf.range(tf.shape(correspondences)[0])
# Pick the largest index for each row
r = tf.reduce_max(m, axis=1)
# While loop accounts for transitive correspondences
# (e.g. if A and B go toghether and B and C go together, then A, B and C go together)
# The loop makes sure every element gets the largest common group id
r_prev = -tf.ones_like(r)
r, _ = tf.while_loop(lambda r, r_prev: tf.reduce_any(tf.not_equal(r, r_prev)),
lambda r, r_prev: (tf.gather(r, r), tf.identity(r)),
[r, r_prev])
# Use unique indices to make sequential group ids starting from 0
return tf.unique(r)[1]
# Test
with tf.Graph().as_default(), tf.Session() as sess:
arr = tf.constant([[1.5 , 0.2 ],
[2.3 , 0.1 ],
[1.3 , 0.21],
[2.2 , 0.09],
[4.4 , 0.8 ],
[1.1 , 0.23]])
a = arr[:, 0]
b = arr[:, 0]
cond = (tf.abs(a - a[:, tf.newaxis]) < 0.3) | (tf.abs(b - b[:, tf.newaxis]) < 0.03)
groups = make_groups(cond)
print(sess.run(groups))
# [0 1 0 1 2 0]
So in this case, the groups would be:
[1.5, 0.2], [1.3, 0.21] and [1.1, 0.23]
[2.3, 0.1] and [2.2, 0.09]
[4.4, 0.8]

How to efficiently apply functions to values in an array based on condition?

I have an array arorg like this:
import numpy as np
arorg = np.array([[-1., 2., -4.], [0.5, -1.5, 3]])
and another array values that looks as follows:
values = np.array([1., 0., 2.])
values has the same number of entries as arorg has columns.
Now I want to apply functions to the entries or arorg depending on whether they are positive or negative:
def neg_fun(val1, val2):
return val1 / (val1 + abs(val2))
def pos_fun(val1, val2):
return 1. / ((val1 / val2) + 1.)
Thereby, val2 is the (absolute) value in arorg and val1 - this is the tricky part - comes from values; if I apply pos_fun and neg_fun to column i in arorg, val1 should be values[i].
I currently implement that as follows:
ar = arorg.copy()
for (x, y) in zip(*np.where(ar > 0)):
ar.itemset((x, y), pos_fun(values[y], ar.item(x, y)))
for (x, y) in zip(*np.where(ar < 0)):
ar.itemset((x, y), neg_fun(values[y], ar.item(x, y)))
which gives me the desired output:
array([[ 0.5 , 1. , 0.33333333],
[ 0.33333333, 0. , 0.6 ]])
As I have to do these calculations very often, I am wondering whether there is a more efficient way of doing this. Something like
np.where(arorg > 0, pos_fun(xxxx), arorg)
would be great but I don't know how to pass the arguments correctly (the xxx). Any suggestions?
As hinted in the question, here's one using np.where.
First off, we are using a direct translation of the function implementation to generate values/arrays for both positive and negative cases. Then, with a mask of positive values, we will choose between those two arrays using np.where.
Thus, the implementation would look something along these lines -
# Get positive and negative values for all elements
val1 = values
val2 = arorg
neg_vals = val1 / (val1 + np.abs(val2))
pos_vals = 1. / ((val1 / val2) + 1.)
# Get a positive mask and choose between positive and negative values
pos_mask = arorg > 0
out = np.where(pos_mask, pos_vals, neg_vals)
You don't need to apply function to zipped elements of arrays, you can accomplish the same thing through simple array operations and slicing.
First, get the positive and negative calculation, saved as arrays. Then create a return array of zeros (just as a default value), and populate it using boolean slices of pos and neg:
import numpy as np
arorg = np.array([[-1., 2., -4.], [0.5, -1.5, 3]])
values = np.array([1., 0., 2.])
pos = 1. / ((values / arorg) + 1)
neg = values / (values + np.abs(arorg))
ret = np.zeros_like(arorg)
ret[arorg>0] = pos[arorg>0]
ret[arorg<=0] = neg[arorg<=0]
ret
# returns:
array([[ 0.5 , 1. , 0.33333333],
[ 0.33333333, 0. , 0.6 ]])
import numpy as np
arorg = np.array([[-1., 2., -4.], [0.5, -1.5, 3]])
values = np.array([1., 0., 2.])
p = 1.0/(values/arorg+1)
n = values/(values+abs(arorg))
#using np.place to extract negative values and put them to p
np.place(p,arorg<0,n[arorg<0])
print(p)
[[ 0.5 1. 0.33333333]
[ 0.33333333 0. 0.6 ]]

Assign numpy array of points to a 2D square grid

I'm going beyond my previous question because of speed problems. I have an array of Lat/Lon coordinates of points, and I would like to assign them to an index code derived from a 2D square grid of equal size cells. This is an example of how it would be. Let's called points my first array containing coordinates (called them [x y] pairs) of six points:
points = [[ 1.5 1.5]
[ 1.1 1.1]
[ 2.2 2.2]
[ 1.3 1.3]
[ 3.4 1.4]
[ 2. 1.5]]
Then I have another array containing the coordinates of the vertices of a grid of two cells in the form [minx,miny,maxx,maxy]; let's call it bounds:
bounds = [[ 0. 0. 2. 2.]
[ 2. 2. 3. 3.]]
I would like to find which points are in which boundary, and then assign a code derived from the bounds array index (in this case the first cell has code 0, the second 1 and so on...). Since the cells are squares, the easiest way to compute if each point is in each cell is to evaluate:
x > minx & x < maxx & y > miny & y < maxy
So that the resulting array would appear as:
results = [0 0 1 0 NaN NaN]
where NaN means that the point is outside cells. The number of elements in my real case is of the order of finding 10^6 points into 10^4 cells. Is there a way to do this kind of things in a fast way using numpy arrays?
EDIT: to clarify, the results array expected means that the first points is inside the first cell (0 index of the bounds array) so the second, and the first is inside the second cell of the bounds array and so on...
Here is a vectorized approach to your problem. It should speed things up significantly.
import numpy as np
def findCells(points, bounds):
# make sure points is n by 2 (pool.map might send us 1D arrays)
points = points.reshape((-1,2))
# check for each point if all coordinates are in bounds
# dimension 0 is bound
# dimension 1 is is point
allInBounds = (points[:,0] > bounds[:,None,0])
allInBounds &= (points[:,1] > bounds[:,None,1])
allInBounds &= (points[:,0] < bounds[:,None,2])
allInBounds &= (points[:,1] < bounds[:,None,3])
# now find out the positions of all nonzero (i.e. true) values
# nz[0] contains the indices along dim 0 (bound)
# nz[1] contains the indices along dim 1 (point)
nz = np.nonzero(allInBounds)
# initialize the result with all nan
r = np.full(points.shape[0], np.nan)
# now use nz[1] to index point position and nz[0] to tell which cell the
# point belongs to
r[nz[1]] = nz[0]
return r
def findCellsParallel(points, bounds, chunksize=100):
import multiprocessing as mp
from functools import partial
func = partial(findCells, bounds=bounds)
# using python3 you could also do 'with mp.Pool() as p:'
p = mp.Pool()
try:
return np.hstack(p.map(func, points, chunksize))
finally:
p.close()
def main():
nPoints = 1e6
nBounds = 1e4
# points = np.array([[ 1.5, 1.5],
# [ 1.1, 1.1],
# [ 2.2, 2.2],
# [ 1.3, 1.3],
# [ 3.4, 1.4],
# [ 2. , 1.5]])
points = np.random.random([nPoints, 2])
# bounds = np.array([[0,0,2,2],
# [2,2,3,3]])
# bounds = np.array([[0,0,1.4,1.4],
# [1.4,1.4,2,2],
# [2,2,3,3]])
bounds = np.sort(np.random.random([nBounds, 2, 2]), 1).reshape(nBounds, 4)
r = findCellsParallel(points, bounds)
print(points[:10])
for bIdx in np.unique(r[:10]):
if np.isnan(bIdx):
continue
print("{}: {}".format(bIdx, bounds[bIdx]))
print(r[:10])
if __name__ == "__main__":
main()
Edit:
Trying it with your amount of data gave me a MemoryError. You can avoid that and even speed things up a little more if you use multiprocessing.Pool with its map function, see updated code.
Result:
>time python test.py
[[ 0.69083585 0.19840985]
[ 0.31732711 0.80462512]
[ 0.30542996 0.08569184]
[ 0.72582609 0.46687164]
[ 0.50534322 0.35530554]
[ 0.93581095 0.36375539]
[ 0.66226118 0.62573407]
[ 0.08941219 0.05944215]
[ 0.43015872 0.95306899]
[ 0.43171644 0.74393729]]
9935.0: [ 0.31584562 0.18404152 0.98215445 0.83625487]
9963.0: [ 0.00526106 0.017255 0.33177741 0.9894455 ]
9989.0: [ 0.17328876 0.08181912 0.33170444 0.23493507]
9992.0: [ 0.34548987 0.15906761 0.92277442 0.9972481 ]
9993.0: [ 0.12448765 0.5404578 0.33981119 0.906822 ]
9996.0: [ 0.41198261 0.50958195 0.62843379 0.82677092]
9999.0: [ 0.437169 0.17833114 0.91096133 0.70713434]
[ 9999. 9993. 9989. 9999. 9999. 9935. 9999. 9963. 9992. 9996.]
real 0m 24.352s
user 3m 4.919s
sys 0m 1.464s
You can use a nested loop with to check the condition and yield the result as a generator :
points = [[ 1.5 1.5]
[ 1.1 1.1]
[ 2.2 2.2]
[ 1.3 1.3]
[ 3.4 1.4]
[ 2. 1.5]]
bounds = [[ 0. ,0. , 2., 2.],
[ 2. ,2. ,3., 3.]]
import numpy as np
def pos(p,b):
for x,y in p:
flag=False
for index,dis in enumerate(b):
minx,miny,maxx,maxy=dis
if x > minx and x < maxx and y > miny and y < maxy :
flag=True
yield index
if not flag:
yield 'NaN'
print list(pos(points,bounds))
result :
[0, 0, 1, 0, 'NaN', 'NaN']
I would do it like this:
import numpy as np
points = np.random.rand(10,2)
xmin = [0.25,0.5]
ymin = [0.25,0.5]
results = np.zeros(len(points))
for i in range(len(xmin)):
bool_index_array = np.greater(points, [xmin[i],ymin[i]])
print "boolean index of (x,y) greater (xmin, ymin): ", bool_index_array
indicies_of_true_true = np.where(bool_index_array[:,0]*bool_index_array[:,1]==1)[0]
print "indices of [True,True]: ", indicies_of_true_true
results[indicies_of_true_true] += 1
print "results: ", results
[out]: [ 1. 1. 1. 2. 0. 0. 1. 1. 1. 1.]
This uses the lower boundaries to catagorize your points into the groups:
1 (if xmin[0] < x <= xmin[1] & ymin[0] < y <= ymin[1])
2 (if x > xmin[1] & y > ymin[1])
0 if none of the conditions above are fullfilled

Categories