Find row indices from two 2d arrays with close values - python

I have an array a
a = np.array([[4, 4],
[5, 4],
[6, 4],
[4, 5],
[5, 5],
[6, 5],
[4, 6],
[5, 6],
[6, 6]])
and an array b
b = np.array([[4.001,4],
[8.001,4],
[5,4.0003],
[5.9999,5]])
I want to find the indices of a that have values very close to those of b. If the b array has the exact same values as the values in a I can use the following code.
np.where((a==b[:,None]).all(-1))[1]
For clarity; I would like the code to return the following: [0,1,5]
These are the indices of a that are very close to the rows in b. The value in b with the value[8.001,4] is discarded as it is not in a.
I think combining the code above with np.allclose() would fix it, however I can't seem to figure out how to do this, can you help me?

Instead of ==, use np.linalg.norm on a - b[:, np.newaxis] to get the distance of each row in a to each row in b.
If a and b have many rows, this will use lots of memory: e.g., if each has 10,000 rows, the vecdiff array below would be 10,000-by-10,000, or 100,000,000 elements; using doubles this is 800MB.
In [50]: a = np.array([[4, 4],
...: [5, 4],
...: [6, 4],
...: [4, 5],
...: [5, 5],
...: [6, 5],
...: [4, 6],
...: [5, 6],
...: [6, 6]])
...:
In [51]: b = np.array([[4.001,4],
...: [8.001,4],
...: [5,4.0003],
...: [5.9999,5]])
In [52]: vecdist = np.linalg.norm(a - b[:, np.newaxis], axis=-1)
In [53]: closeidx = np.flatnonzero(vecdist.min(axis=0) < 1e-2)
In [54]: print(closeidx)
[0 1 5]

Related

Search a tensor for data/values

Given:
tensor([[6, 6],
[4, 8],
[7, 5],
[7, 4],
[6, 4]])
How do I find the index of rows with values [7,5]?
In general, how do I search for indices of any values: full and partial row or column?
Try with this:
>>> (a[:, None] == torch.tensor([7, 5])).all(-1).any(-1).nonzero().flatten().item()
2
>>>

Add a matrix outside a loop

I have a function that gives me a matrix of 17*3 (float (17,3)). I call that function again and again in a loop, i want to add the matrices so that rows remain 17 but column keeps on adding to make one big matrix.
Without NUMPY:
Transpose the matrix first because you are not gonna touch the 17 rows.
# a matrix is 17 * 3
a_transpose = [[a[j][i] for j in range(len(a))] for i in range(len(a[0]))]
Then, add the column of 17 rows as one row of 17 columns
a_transpose.append([1,2,3, ... 17])
Once, you are done adding several rows, transpose the matrix back as said above. That way, you don't iterate through your array 17 times everytime you add a column to your matrix.
With NUMPY:
Transpose
# a matrix is 17 * 3
a = numpy.array(a)
a_transpose = a.transpose()
Add a row (17 column values you wanted to add)
a_transpose.append([1,2,3, .... 17], axis=0)
Your function:
In [187]: def foo(i):
...: return np.arange(i,i+6).reshape(3,2)
...:
Iteratively build a list of arrays:
In [188]: alist = []
In [189]: for i in range(4):
...: alist.append(foo(i))
...:
In [190]: alist
Out[190]:
[array([[0, 1],
[2, 3],
[4, 5]]), array([[1, 2],
[3, 4],
[5, 6]]), array([[2, 3],
[4, 5],
[6, 7]]), array([[3, 4],
[5, 6],
[7, 8]])]
Make an array from that list:
In [191]: np.concatenate(alist, axis=1)
Out[191]:
array([[0, 1, 1, 2, 2, 3, 3, 4],
[2, 3, 3, 4, 4, 5, 5, 6],
[4, 5, 5, 6, 6, 7, 7, 8]])

cumulative argmax of a numpy array

Consider the array a
np.random.seed([3,1415])
a = np.random.randint(0, 10, (10, 2))
a
array([[0, 2],
[7, 3],
[8, 7],
[0, 6],
[8, 6],
[0, 2],
[0, 4],
[9, 7],
[3, 2],
[4, 3]])
What is a vectorized way to get the cumulative argmax?
array([[0, 0], <-- both start off as max position
[1, 1], <-- 7 > 0 so 1st col = 1, 3 > 2 2nd col = 1
[2, 2], <-- 8 > 7 1st col = 2, 7 > 3 2nd col = 2
[2, 2], <-- 0 < 8 1st col stays the same, 6 < 7 2nd col stays the same
[2, 2],
[2, 2],
[2, 2],
[7, 2], <-- 9 is new max of 2nd col, argmax is now 7
[7, 2],
[7, 2]])
Here is a non-vectorized way to do it.
Notice that as the window expands, argmax applies to the growing window.
pd.DataFrame(a).expanding().apply(np.argmax).astype(int).values
array([[0, 0],
[1, 1],
[2, 2],
[2, 2],
[2, 2],
[2, 2],
[2, 2],
[7, 2],
[7, 2],
[7, 2]])
Here's a vectorized pure NumPy solution that performs pretty snappily:
def cumargmax(a):
m = np.maximum.accumulate(a)
x = np.repeat(np.arange(a.shape[0])[:, None], a.shape[1], axis=1)
x[1:] *= m[:-1] < m[1:]
np.maximum.accumulate(x, axis=0, out=x)
return x
Then we have:
>>> cumargmax(a)
array([[0, 0],
[1, 1],
[2, 2],
[2, 2],
[2, 2],
[2, 2],
[2, 2],
[7, 2],
[7, 2],
[7, 2]])
Some quick testing on arrays with thousands to millions of values suggests that this is anywhere between 10-50 times faster than looping at the Python level (either implicitly or explicitly).
I cant think of a way to vectorize this over both columns easily; but if the number of columns is small relative to the number of rows, that shouldn't be an issue and a for loop should suffice for that axis:
import numpy as np
import numpy_indexed as npi
a = np.random.randint(0, 10, (10))
max = np.maximum.accumulate(a)
idx = npi.indices(a, max)
print(idx)
I would like to make a function that computes cumulative argmax for 1d array and then apply it to all columns. This is the code:
import numpy as np
np.random.seed([3,1415])
a = np.random.randint(0, 10, (10, 2))
def cumargmax(v):
uargmax = np.frompyfunc(lambda i, j: j if v[j] > v[i] else i, 2, 1)
return uargmax.accumulate(np.arange(0, len(v)), 0, dtype=np.object).astype(v.dtype)
np.apply_along_axis(cumargmax, 0, a)
The reason for converting to np.object and then converting back is a workaround for Numpy 1.9, as mentioned in generalized cumulative functions in NumPy/SciPy?

Updating a NumPy array with another

Seemingly simple question: I have an array with two columns, the first represents an ID and the second a count. I'd like to update it with another, similar array such that
import numpy as np
a = np.array([[1, 2],
[2, 2],
[3, 1],
[4, 5]])
b = np.array([[2, 2],
[3, 1],
[4, 0],
[5, 3]])
a.update(b) # ????
>>> np.array([[1, 2],
[2, 4],
[3, 2],
[4, 5],
[5, 3]])
Is there a way to do this with indexing/slicing such that I don't simply have to iterate over each row?
Generic case
Approach #1: You can use np.add.at to do such an ID-based adding operation like so -
# First column of output array as the union of first columns of a,b
out_id = np.union1d(a[:,0],b[:,0])
# Initialize second column of output array
out_count = np.zeros_like(out_id)
# Find indices where the first columns of a,b are placed in out_id
_,a_idx = np.where(a[:,None,0]==out_id)
_,b_idx = np.where(b[:,None,0]==out_id)
# Place second column of a into out_id & add in second column of b
out_count[a_idx] = a[:,1]
np.add.at(out_count, b_idx,b[:,1])
# Stack the ID and count arrays into a 2-column format
out = np.column_stack((out_id,out_count))
To find a_idx and b_idx, as probably a faster alternative, np.searchsorted could be used like so -
a_idx = np.searchsorted(out_id, a[:,0], side='left')
b_idx = np.searchsorted(out_id, b[:,0], side='left')
Sample input-output :
In [538]: a
Out[538]:
array([[1, 2],
[4, 2],
[3, 1],
[5, 5]])
In [539]: b
Out[539]:
array([[3, 7],
[1, 1],
[4, 0],
[2, 3],
[6, 2]])
In [540]: out
Out[540]:
array([[1, 3],
[2, 3],
[3, 8],
[4, 2],
[5, 5],
[6, 2]])
Approach #2: You can use np.bincount to do the same ID based adding -
# First column of output array as the union of first columns of a,b
out_id = np.union1d(a[:,0],b[:,0])
# Get all IDs and counts in a single arrays
id_arr = np.concatenate((a[:,0],b[:,0]))
count_arr = np.concatenate((a[:,1],b[:,1]))
# Get binned summations
summed_vals = np.bincount(id_arr,count_arr)
# Get mask of valid bins
mask = np.in1d(np.arange(np.max(out_id)+1),out_id)
# Mask valid summed bins for final counts array output
out_count = summed_vals[mask]
# Stack the ID and count arrays into a 2-column format
out = np.column_stack((out_id,out_count))
Specific case
If the ID columns in a and b are sorted, it becomes easier, as we can just use masks with np.in1d to index into the output ID array created with np.union like so -
# First column of output array as the union of first columns of a,b
out_id = np.union1d(a[:,0],b[:,0])
# Masks of first columns of a and b matches in the output ID array
mask1 = np.in1d(out_id,a[:,0])
mask2 = np.in1d(out_id,b[:,0])
# Initialize second column of output array
out_count = np.zeros_like(out_id)
# Place second column of a into out_id & add in second column of b
out_count[mask1] = a[:,1]
np.add.at(out_count, np.where(mask2)[0],b[:,1])
# Stack the ID and count arrays into a 2-column format
out = np.column_stack((out_id,out_count))
Sample run -
In [552]: a
Out[552]:
array([[1, 2],
[2, 2],
[3, 1],
[4, 5],
[8, 5]])
In [553]: b
Out[553]:
array([[2, 2],
[3, 1],
[4, 0],
[5, 3],
[6, 2],
[8, 2]])
In [554]: out
Out[554]:
array([[1, 2],
[2, 4],
[3, 2],
[4, 5],
[5, 3],
[6, 2],
[8, 7]])
>>> col=np.unique(np.hstack((b[:,0],a[:,0])))
>>> dif=np.setdiff1d(col,a[:,0])
>>> val=b[np.in1d(b[:,0],dif)]
>>> result=np.concatenate((a,val))
array([[1, 2],
[2, 2],
[3, 1],
[4, 5],
[5, 3]])
Note that if you want the result become sorted you can use np.lexsort :
result[np.lexsort((result[:,0],result[:,0]))]
Explanation :
First you can find the unique ids with following command :
>>> col=np.unique(np.hstack((b[:,0],a[:,0])))
>>> col
array([1, 2, 3, 4, 5])
Then find the different between the ids if a and all of ids :
>>> dif=np.setdiff1d(col,a[:,0])
>>> dif
array([5])
Then find the items within b with the ids in diff :
>>> val=b[np.in1d(b[:,0],dif)]
>>> val
array([[5, 3]])
And at last concatenate the result with list a:
>>> np.concatenate((a,val))
consider another example with sorting :
>>> a = np.array([[1, 2],
... [2, 2],
... [3, 1],
... [7, 5]])
>>>
>>> b = np.array([[2, 2],
... [3, 1],
... [4, 0],
... [5, 3]])
>>>
>>> col=np.unique(np.hstack((b[:,0],a[:,0])))
>>> dif=np.setdiff1d(col,a[:,0])
>>> val=b[np.in1d(b[:,0],dif)]
>>> result=np.concatenate((a,val))
>>> result[np.lexsort((result[:,0],result[:,0]))]
array([[1, 2],
[2, 2],
[3, 1],
[4, 0],
[5, 3],
[7, 5]])
That's an old question but here is a solution with pandas (that could be generalized for other aggregation functions than sum). Also sorting will occur automatically:
import pandas as pd
import numpy as np
a = np.array([[1, 2],
[2, 2],
[3, 1],
[4, 5]])
b = np.array([[2, 2],
[3, 1],
[4, 0],
[5, 3]])
print((pd.DataFrame(a[:, 1], index=a[:, 0])
.add(pd.DataFrame(b[:, 1], index=b[:, 0]), fill_value=0)
.astype(int))
.reset_index()
.to_numpy())
Output:
[[1 2]
[2 4]
[3 2]
[4 5]
[5 3]]

subtracting a certain row in a matrix

So I have a 4 by 4 matrix. [[1,2,3,4],[2,3,4,5],[3,4,5,6],[4,5,6,7]]
I need to subtract the second row by [1,2,3,4]
no numpy if possible. I'm a beginner and don't know how to use that
thnx
With regular Python loops:
a = [[1,2,3,4],[2,3,4,5],[3,4,5,6],[4,5,6,7]]
b = [1,2,3,4]
for i in range(4):
a[1][i] -= b[i]
Simply loop over the entries in the b list and subtract from the corresponding entries in a[1], the second list (ie row) of the a matrix.
However, NumPy can do this for you faster and easier and isn't too hard to learn:
In [47]: import numpy as np
In [48]: a = np.array([[1,2,3,4],[2,3,4,5],[3,4,5,6],[4,5,6,7]])
In [49]: a
Out[49]:
array([[1, 2, 3, 4],
[2, 3, 4, 5],
[3, 4, 5, 6],
[4, 5, 6, 7]])
In [50]: a[1] -= [1,2,3,4]
In [51]: a
Out[51]:
array([[1, 2, 3, 4],
[1, 1, 1, 1],
[3, 4, 5, 6],
[4, 5, 6, 7]])
Note that NumPy vectorizes many of its operations (such as subtraction), so the loops involved are handled for you (in fast, pre-compiled C-code).

Categories