Slicing a matrix by an index matrix in TensorFlow - python

I have an n-by-m matrix X and an n-by-r index matrix I. I am wondering what are the relevant TensorFlow operators that allow me to get an n-by-r matrix R such that R[i,j] = X[i,I[i,j]]. As an example, let's say
X = tf.constant([[1,2,3],
[4,5,6],
[7,8,9]])
I = tf.constant([[1,2],
[1,0],
[0,2]])
The desired result would be a tensor
R = [[2, 3],
[5, 4],
[7, 9]]
I tried to use each column of the matrix I as the index and do tf.diag_part(tf.gather(X', index)), which seems to give me one column of R if I has the same number of row as X. For example,
idx = tf.transpose(I)[0] #[1,1,0]
res = tf.diag_part(tf.gather(tf.transpose(X), idx))
# res will be [2,5,7], i,e, first colum of R
Another attempt:
res = tf.transpose(tf.gather(tf.transpose(X), I),[0,2,1])
print(res.eval())
array([[[2, 3],
[5, 6],
[8, 9]],
[[2, 1],
[5, 4],
[8, 7]],
[[3, 1],
[6, 4],
[7, 9]]], dtype=int32)
From here i just need to be able to select the "diagonal entries" res[0,0], res[1,1] and res[2,2] to get R. I get stuck here though...

Use tf.gather with batch_dims argument:
res = tf.gather(X, I, batch_dims=1)

Related

Split numpy 2D array based on separate label array

I have a 2D numpy array A. For example:
A = np.array([[1, 2],
[3, 4],
[5, 6],
[7, 8],
[9, 0]])
I have another label array B corresponding to rows of A. For example:
B = np.array([0, 1, 2, 0, 1])
I want to split A into 3 arrays based on their labels, so the result would be:
[[[1, 2],
[7, 8]],
[[3, 4],
[9, 0]],
[[5, 6]]]
Are there any numpy built in functions to achieve this?
Right now, my solution is rather ugly and involves repeating calling numpy.where in a for-loop, and slicing the indices tuples to contain only the rows.
Here's one way to do it:
hstack both the array together.
sort the array by the last column
split the array based on unique value index
a = np.hstack((A,B[:,None]))
a = a[a[:, -1].argsort()]
a = np.split(a[:,:-1], np.unique(a[:, -1], return_index=True)[1][1:])
OUTPUT:
[array([[1, 2],
[7, 8]]),
array([[3, 4],
[9, 0]]),
array([[5, 6]])]
If the output can always be an array because the labels are equally distributed, you only need to sort the data by label:
idx = B.argsort()
n = np.flatnonzero(np.diff(idx))[0] + 1
result = A[idx].reshape(n, A.shape[0] // n, A.shape[1])
If the labels aren't equally distributed, you'll have to make a list in the outer dimension:
_, indices, counts = np.unique(B, return_counts=True, return_inverse=True)
result = np.split(A[indices.argsort()], counts.cumsum()[:-1])
Using the equivalent of np.where is not very efficient, but you can do it without a loop:
b, idx = np.unique(B, return_inverse=True)
mask = idx[:, None] == np.arange(b.size)
result = np.split(A[idx.argsort()], np.count_nonzero(mask, axis=0).cumsum()[:-1])
You can compute the mask simulataneously for all the labels and apply it to the sorted A (A[idx.argsort()]) by counting the number of matching elements in each category (np.count_nonzero(mask, axis=0).cumsum()). The last index is stripped off the cumulative sum because np.split always adds an implicit total index.
You could also use Pandas for this because it's designed for labelled data and has a powerful groupby method.
import pandas as pd
index = pd.Index(B, name='label')
df = pd.DataFrame(A, index=index)
groups = {k: v.values for k, v in df.groupby('label')}
print(groups)
This produces a dictionary of arrays of the grouped values:
{0: array([[1, 2],
[7, 8]]), 1: array([[3, 4],
[9, 0]]), 2: array([[5, 6]])}
For a list of the arrays you can do this instead:
groups = [v.values for k, v in df.groupby('label')]
This is probably the simplest way:
groups = [A[B == label, :] for label in np.unique(B)]
print(groups)
Output:
[array([[1, 2],
[7, 8]]), array([[3, 4],
[9, 0]]), array([[5, 6]])]

extracting subtensor from a tensor according to an index tensor

I have this tensor:
tensor([[[1, 2],
[3, 4]],
[[5, 6],
[7, 8]]])
and I have this index tensor:
tensor([0, 1])
and what I want to get is the subtensors according to dim 1 and the corresponding indices in the index tensor, that is:
tensor([[1, 2],
[7, 8]])
tried to use torch.gather() function and advanced indexing with no success, can anyone help?
You are implicitly using the index of each value of your index tensor. They just happen to be the same as the values. If you want to walk through the first level, elements of the tensor, you can use torch.arange to construct the first level indices.
import torch
from torch import tensor
t = tensor([[[1, 2],
[3, 4]],
[[5, 6],
[7, 8]]])
ix = tensor([0, 1])
ix0 = torch.arange(0, ix.shape.numel())
t[ix0, ix]
# returns:
tensor([[1, 2],
[7, 8]])

Scatter tensor in pytorch along the rows

I want to scatter tensors in granularities of rows.
For example consider,
Input = torch.tensor([[2, 3], [3, 4], [4, 5]])
I want to scatter
S = torch.tensor([[1,2],[1,2]])
to indices
I = torch.tensor([0,2])
I expect the output to be torch.tensor([[1, 2], [3, 4], [1, 2]]).
Here S[0] is scattered to Input[I[0]], similarly S[1] is scattered to Input[I[1]]
How can I achieve this? Instead of looping over the row in S, I am looking for a more efficient way.
Do input[I] = S
Example:
input = torch.tensor([[2, 3], [3, 4], [4, 5]])
S = torch.tensor([[1,2],[1,2]])
I = torch.tensor([0,2])
input[I] = S
input
tensor([[1, 2],
[3, 4],
[1, 2]])
Answer might be a little late, but anyway, you could do:
import torch
inp = torch.tensor([[2,3], [3,4], [4,5]])
src = torch.tensor([[1,2], [1,2]])
idxs = torch.tensor([[0,0],[2,2]])
y = torch.scatter(inp, 0, idxs, src)

Reorganizing a 3d numpy array

I've tried and searched for a few days, I've come closer but need your help.
I have a 3d array in python,
shape(files)
>> (31,2049,2)
which corresponds to 31 input files with 2 columns of data with 2048 rows and a header.
I'd like to sort this array based on the header, which is a number, in each file.
I tried to follow NumPy: sorting 3D array but keeping 2nd dimension assigned to first , but i'm incredibly confused.
First I try to setup get my headers for the argsort, I thought I could do
sortval=files[:][0][0]
but this does not work..
Then I simply did a for loop to iterate and get my headers
for i in xrange(shape(files)[0]:
sortval.append([i][0][0])
Then
sortedIdx = np.argsort(sortval)
This works, however I dont understand whats happening in the last line..
files = files[np.arange(len(deck))[:,np.newaxis],sortedIdx]
Help would be appreciated.
Another way to do this is with np.take
header = a[:,0,0]
sorted = np.take(a, np.argsort(header), axis=0)
Here we can use a simple example to demonstrate what your code is doing:
First we create a random 3D numpy matrix:
a = (np.random.rand(3,3,2)*10).astype(int)
array([[[3, 1],
[3, 7],
[0, 3]],
[[2, 9],
[1, 0],
[9, 2]],
[[9, 2],
[8, 8],
[8, 0]]])
Then a[:] will gives a itself, and a[:][0][0] is just the first row in first 2D array in a, which is:
a[:][0]
# array([[3, 1],
# [3, 7],
# [0, 3]])
a[:][0][0]
# array([3, 1])
What you want is the header which are 3,2,9 in this example, so we can use a[:, 0, 0] to extract them:
a[:,0,0]
# array([3, 2, 9])
Now we sort the above list and get an index array:
np.argsort(a[:,0,0])
# array([1, 0, 2])
In order to rearrange the entire 3D array, we need to slice the array with correct order. And np.arange(len(a))[:,np.newaxis] is equal to np.arange(len(a)).reshape(-1,1) which creates a sequential 2D index array:
np.arange(len(a))[:,np.newaxis]
# array([[0],
# [1],
# [2]])
Without the 2D array, we will slice the array to 2 dimension
a[np.arange(3), np.argsort(a[:,0,0])]
# array([[3, 7],
# [2, 9],
# [8, 0]])
With the 2D array, we can perform 3D slicing and keeps the shape:
a[np.arange(3).reshape(-1,1), np.argsort(a[:,0,0])]
array([[[3, 7],
[3, 1],
[0, 3]],
[[1, 0],
[2, 9],
[9, 2]],
[[8, 8],
[9, 2],
[8, 0]]])
And above is the final result you want.
Edit:
To arange the 2D arrays:, one could use:
a[np.argsort(a[:,0,0])]
array([[[2, 9],
[1, 0],
[9, 2]],
[[3, 1],
[3, 7],
[0, 3]],
[[9, 2],
[8, 8],
[8, 0]]])

Filtering multiple NumPy arrays based on the intersection of one column

I have three rather large NumPy arrays with varying numbers of rows, whose first columns are all integers. My hope is to filter these arrays such that the only rows left are those for whom the value in the first column is shared by all three. This would leave three arrays of the same size. The entries in the other columns are not necessarily shared across arrays.
So, with input:
A =
[[1, 1],
[2, 2],
[3, 3],]
B =
[[2, 1],
[3, 2],
[4, 3],
[5, 4]]
C =
[[2, 2],
[3, 1]
[5, 2]]
I hope to get back as output:
A =
[[2, 2],
[3, 3]]
B =
[[2, 1],
[3, 2]]
C =
[[2, 2],
[3, 1]]
My current approach is to:
Find the intersection of the three first columns using numpy.intersect1d()
Use numpy.in1d() on this intersection and the first columns of each array to find the row indices that are not shared in each array (converting boolean to index using a modified version of the method found here: Python: intersection indices numpy array )
Finally using numpy.delete() with each of these indices and its respective array to remove rows with non-shared entries in the first column.
I'm wondering if there might be a faster or more elegantly Pythonic way to go about this however, something that is suited to very large arrays.
Your indices in your example are sorted and unique. Assuming this is no coincidence (and this situation often arises, or can easily be enforced), the following works:
import numpy as np
A = np.array(
[[1, 1],
[2, 2],
[3, 3],])
B = np.array(
[[2, 1],
[3, 2],
[4, 3],
[5, 4]])
C = np.array(
[[2, 2],
[3, 1],
[5, 2],])
I = reduce(
lambda l,r: np.intersect1d(l,r,True),
(i[:,0] for i in (A,B,C)))
print A[np.searchsorted(A[:,0], I)]
print B[np.searchsorted(B[:,0], I)]
print C[np.searchsorted(C[:,0], I)]
and in case the first column is not in sorted order (but is still unique):
C = np.array(
[[9, 2],
[1,6],
[5, 1],
[2, 5],
[3, 2],])
def index_by_first_column_entry(M, keys):
colkeys = M[:,0]
sorter = np.argsort(colkeys)
index = np.searchsorted(colkeys, keys, sorter = sorter)
return M[sorter[index]]
print index_by_first_column_entry(C, I)
and make sure to change the true to false in
I = reduce(
lambda l,r: np.intersect1d(l,r,False),
(i[:,0] for i in (A,B,C)))
generalization to duplicate values can be made using np.unique
One way to do this is to build an indicator array, or a hash table if you like, to indicate which integers are in all your input arrays. Then you can use boolean indexing based on this indicator array to get the subarrays. Something like this:
import numpy as np
# Setup
A = np.array(
[[1, 1],
[2, 2],
[3, 3],])
B = np.array(
[[2, 1],
[3, 2],
[4, 3],
[5, 4]])
C = np.array(
[[2, 2],
[3, 1],
[5, 2],])
def take_overlap(*input):
n = len(input)
maxIndex = max(array[:, 0].max() for array in input)
indicator = np.zeros(maxIndex + 1, dtype=int)
for array in input:
indicator[array[:, 0]] += 1
indicator = indicator == n
result = []
for array in input:
# Look up each integer in the indicator array
mask = indicator[array[:, 0]]
# Use boolean indexing to get the sub array
result.append(array[mask])
return result
subA, subB, subC = take_overlap(A, B, C)
This should be quite fast and this method does not assume the elements of the input arrays are unique or sorted. However this method could take a lot of memory, and might e a bit slower, if the indexing integers are sparse, ie [1, 10, 10000], but should be close to optimal if the integers are more or less dense.
This works but I'm not sure if it is faster than any of the other answers:
import numpy as np
A = np.array(
[[1, 1],
[2, 2],
[3, 3],])
B = np.array(
[[2, 1],
[3, 2],
[4, 3],
[5, 4]])
C = np.array(
[[2, 2],
[3, 1],
[5, 2],])
a = A[:,0]
b = B[:,0]
c = C[:,0]
ab = np.where(a[:, np.newaxis] == b[np.newaxis, :])
bc = np.where(b[:, np.newaxis] == c[np.newaxis, :])
ab_in_bc = np.in1d(ab[1], bc[0])
bc_in_ab = np.in1d(bc[0], ab[1])
arows = ab[0][ab_in_bc]
brows = ab[1][ab_in_bc]
crows = bc[1][bc_in_ab]
anew = A[arows, :]
bnew = B[brows, :]
cnew = C[crows, :]
print(anew)
print(bnew)
print(cnew)
gives:
[[2 2]
[3 3]]
[[2 1]
[3 2]]
[[2 2]
[3 1]]

Categories