Efficient boolean masking with Tensorflow SparseTensors - python

So, I want to mask out entire rows of a SparseTensor. This would be easy to do with tf.boolean_mask, but there isn't an equivalent for SparseTensors. Currently, something that is possible is for me to just go through all of the indices in SparseTensor.indices and filter out all of the ones that aren't a masked row, e.g.:
masked_indices = list(filter(lambda index: masked_rows[index[0]], indices))
where masked_rows is a 1D array of whether or not the row at that index is masked.
However, this is really slow, since my SparseTensor is fairly large (it has 90k indices, but will be growing to be significantly larger). It takes quite a few seconds on a single data point, before I even apply SparseTensor.mask on the filtered indices. Another flaw of the approach is that it doesn't actually remove the rows, either (although, in my case, a row of all zeros is just as fine).
Is there a better way to mask a SparseTensor by row, or is this the best approach?

You can do that like this:
import tensorflow as tf
def boolean_mask_sparse_1d(sparse_tensor, mask, axis=0): # mask is assumed to be 1D
mask = tf.convert_to_tensor(mask)
ind = sparse_tensor.indices[:, axis]
mask_sp = tf.gather(mask, ind)
new_size = tf.math.count_nonzero(mask)
new_shape = tf.concat([sparse_tensor.shape[:axis], [new_size],
sparse_tensor.shape[axis + 1:]], axis=0)
new_shape = tf.dtypes.cast(new_shape, tf.int64)
mask_count = tf.cumsum(tf.dtypes.cast(mask, tf.int64), exclusive=True)
masked_idx = tf.boolean_mask(sparse_tensor.indices, mask_sp)
new_idx_axis = tf.gather(mask_count, masked_idx[:, axis])
new_idx = tf.concat([masked_idx[:, :axis],
tf.expand_dims(new_idx_axis, 1),
masked_idx[:, axis + 1:]], axis=1)
new_values = tf.boolean_mask(sparse_tensor.values, mask_sp)
return tf.SparseTensor(new_idx, new_values, new_shape)
# Test
sp = tf.SparseTensor([[1], [3], [4], [6]], [1, 2, 3, 4], [7])
mask = tf.constant([True, False, True, True, False, False, True])
out = boolean_mask_sparse_1d(sp, mask)
print(out.indices.numpy())
# [[2]
# [3]]
print(out.values.numpy())
# [2 4]
print(out.shape)
# (4,)

Related

How to elegantly drop unnecessary elements in numpy?

I have an ndarray of shape [batch_size, seq_len, num_features]. However, some of elements in the end of the sequential dimension is not necessary, and therefore I want to drop them and merge the sequential dimension into the batch dimension. For example, the ndarray a I want to manipulate is
batch_size = 2
seq_len = 3
num_features = 1
a = np.random.randn(batch_size, seq_len, num_features)
mask = np.ones((batch_size, seq_len), dtype=np.bool)
mask[0][1:] = 0
mask[1][2:] = 0
"""
>>> a = [[[-0.3908401 ]
[ 0.89686512]
[ 0.07594243]]
[[-0.12256737]
[-1.00838131]
[ 0.56543754]]]
mask=[[ True False False]
[ True True False]]
"""
where mask is used to indicate whether the elements in a is useful. I can get what I want using the following code
res = []
for seq, m in zip(a, mask):
res.append(seq[:sum(m)])
np.concatenate(res, axis=0)
"""
>>>array([[0.08676509],
[0.47162315],
[0.98070665]])
"""
I'm wondering if there is a more elegant way to do this in numpy?
Not sure if this is what your asking but the results look fine
res = a[mask]
Since dimensions related to batch and seq are going to be merged, you could reshape both a and mask to 2D array of shape (batch_size * seq_len, num_features).
Next, simply filter important samples using boolean index. See the code:
mask2d = mask.reshape(-1) # or mask.ravel()
a2d = a.reshape(-1, num_features)
result = a2d[mask2d]

Get masked argmax with different mask for each row in TensorFlow

I have a tensor of shape Nx7, which looks something like this:
[0.97863993 0.64479575 -0.202357 0.94678476 0.0080051 0.44507797 0.47864
0.05914348 -0.72649432 0.193803 0.47295245 0.8381458 0.30449861 0.46783]
I have another tensor of the same shape, which is a boolean mask:
[True False True True False True False
False True False False True False False]
I want to get the argmax of each row in the first tensor, but only of those elements for which the mask is True, so basically the argmax of the following array:
[0.97863993 X -0.202357 0.94678476 X 0.44507797 X
X -0.72649432 X X 0.8381458 X X]
Which should thus become:
[0
4]
Is this possible in TensorFlow? I am trying to figure it out with tf.boolean_mask, but I don't see how to deal with different rows having differing numbers of True values in the mask.
Input code in TF:
mask = tf.placeholder(shape=[None, 7], dtype=tf.bool)
val = tf.placeholder(shape=[None, 7], dtype=tf.float32)
arg_max = ???
Note that I want negative values to be handled correctly as well (otherwise the method proposed by Ishant Mrinal would work).
Convert the boolean array into a float array
# mask = tf.placeholder(shape=[None, 7], dtype=tf.bool)
# mask = tf.cast(mask, dtype=tf.float32)
mask = tf.placeholder(shape=[None, 7], dtype=tf.float32)
val = tf.placeholder(shape=[None, 7], dtype=tf.float32)
argmax = tf.argmax(tf.multiply(val, mask), axis=1)
sess.run(argmax, {val: your_val_array, mask: 2*mask_bool_array.astype(float)-1 })
To emulate a masked argmax, you can set values outside of the mask to -inf, for example like this:
masked_val = tf.minimum(val, (2* tf.to_float(mask) - 1) * np.inf)
masked_arg_max = tf.argmax(masked_val, axis=1)
Alternatively, to compute masked_val, you could use
masked_val = tf.where(mask, val, -tf.ones_like(val) * np.inf)
which is arguably clearer, but may waste memory.
For a masked argmin, you would do the opposite:
masked_val = tf.maximum(val, (1 - 2* tf.to_float(mask)) * np.inf)
masked_arg_min = tf.argmin(masked_val, axis=1)

Tensorflow - pick values from indicies, what is the operation called?

An example
Suppose I have a tensor values with shape (2,2,2)
values = [[[0, 1],[2, 3]],[[4, 5],[6, 7]]]
And a tensor indicies with shape (2,2) which describes what values to be selected in the innermost dimension
indicies = [[1,0],[0,0]]
Then the result will be a (2,2) matrix with these values
result = [[1,2],[4,6]]
What is this operation called in tensorflow and how to do it?
General
Note that the above shape (2,2,2) is only an example, it can be any dimension. Some conditions for this operation:
ndim(values) -1 = ndim(indicies)
values.shape[:-1] == indicies.shape == result.shape
indicies.max() < values.shape[-1] -1
I think you can emulate this with tf.gather_nd. You will just have to convert "your" indices to a representation that is suitable for tf.gather_nd. The following example here is tied to your specific example, i.e. input tensors of shape (2, 2, 2) but I think this gives you an idea how you could write the conversion for input tensors with arbitrary shape, although I am not sure how easy it would be to implement this (haven't thought about it too long). Also, I'm not claiming that this is the easiest possible solution.
import tensorflow as tf
import numpy as np
values = np.array([[[0, 1], [2, 3]], [[4, 5], [6, 7]]])
values_tf = tf.constant(values)
indices = np.array([[1, 0], [0, 0]])
converted_idx = []
for k in range(values.shape[0]):
outer = []
for l in range(values.shape[1]):
inds = [k, l, indices[k][l]]
outer.append(inds)
print(inds)
converted_idx.append(outer)
with tf.Session() as sess:
result = tf.gather_nd(values_tf, converted_idx)
print(sess.run(result))
This prints
[[1 2]
[4 6]]
Edit: To handle arbitrary shapes here is a recursive solution that should work (only tested on your example):
def convert_idx(last_dim_vals, ori_indices, access_to_ori, depth):
if depth == len(last_dim_vals.shape) - 1:
inds = access_to_ori + [ori_indices[tuple(access_to_ori)]]
return inds
outer = []
for k in range(ori_indices.shape[depth]):
inds = convert_idx(last_dim_vals, ori_indices, access_to_ori + [k], depth + 1)
outer.append(inds)
return outer
You can use this together with the original code I posted like so:
...
converted_idx = convert_idx(values, indices, [], 0)
with tf.Session() as sess:
result = tf.gather_nd(values_tf, converted_idx)
print(sess.run(result))

Vectorizing an operation between all pairs of elements in two numpy arrays

Given two arrays where each row represents a circle (x, y, r):
data = {}
data[1] = np.array([[455.108, 97.0478, 0.0122453333],
[403.775, 170.558, 0.0138770952],
[255.383, 363.815, 0.0179857619]])
data[2] = np.array([[455.103, 97.0473, 0.012041],
[210.19, 326.958, 0.0156912857],
[455.106, 97.049, 0.0150472381]])
I would like to pull out all of the pairs of circles that are not disjointed. This can be done by:
close_data = {}
for row1 in data[1]: #loop over first array
for row2 in data[2]: #loop over second array
condition = ((abs(row1[0]-row2[0]) + abs(row1[1]-row2[1])) < (row1[2]+row2[2]))
if condition: #circles overlap if true
if tuple(row1) not in close_data.keys():
close_data[tuple(row1)] = [row1, row2] #pull out close data points
else:
close_data[tuple(row1)].append(row2)
for k, v in close_data.iteritems():
print k, v
#desired outcome
#(455.108, 97.047799999999995, 0.012245333299999999)
#[array([ 4.55108000e+02, 9.70478000e+01, 1.22453333e-02]),
# array([ 4.55103000e+02, 9.70473000e+01, 1.2040000e-02]),
# array([ 4.55106000e+02, 9.70490000e+01, 1.50472381e-02])]
However the multiple loops over the arrays are very inefficient for large datasets. Is it possible to vectorize the calculations so I get the advantage of using numpy?
The most difficult bit is actually getting to your representation of the info. Oh, and I inserted a few squares. If you really don't want Euclidean distances you have to change back.
import numpy as np
data = {}
data[1] = np.array([[455.108, 97.0478, 0.0122453333],
[403.775, 170.558, 0.0138770952],
[255.383, 363.815, 0.0179857619]])
data[2] = np.array([[455.103, 97.0473, 0.012041],
[210.19, 326.958, 0.0156912857],
[455.106, 97.049, 0.0150472381]])
d1 = data[1][:, None, :]
d2 = data[2][None, :, :]
dists2 = ((d1[..., :2] - d2[..., :2])**2).sum(axis = -1)
radss2 = (d1[..., 2] + d2[..., 2])**2
inds1, inds2 = np.where(dists2 <= radss2)
# translate to your representation:
bnds = np.r_[np.searchsorted(inds1, np.arange(3)), len(inds1)]
rows = [data[2][inds2[bnds[i]:bnds[i+1]]] for i in range(3)]
out = dict([(tuple (data[1][i]), rows[i]) for i in range(3) if rows[i].size > 0])
Here is a pure numpythonic way (a is data[1] and b is data[2]):
In [80]: p = np.arange(3) # for creating the indices of combinations using np.tile and np.repeat
In [81]: a = a[np.repeat(p, 3)] # creates the first column of combination array
In [82]: b = b[np.tile(p, 3)] # creates the second column of combination array
In [83]: abs(a[:, :2] - b[:, :2]).sum(1) < a[:, 2] + b[:, 2]
Out[83]: array([ True, False, True, True, False, True, True, False, True], dtype=bool)

TensorFlow getting elements of every row for specific columns

If A is a TensorFlow variable like so
A = tf.Variable([[1, 2], [3, 4]])
and index is another variable
index = tf.Variable([0, 1])
I want to use this index to select columns in each row. In this case, item 0 from first row and item 1 from second row.
If A was a Numpy array then to get the columns of corresponding rows mentioned in index we can do
x = A[np.arange(A.shape[0]), index]
and the result would be
[1, 4]
What is the TensorFlow equivalent operation/operations for this? I know TensorFlow doesn't support many indexing operations. What would be the work around if it cannot be done directly?
You can extend your column indices with row indices and then use gather_nd:
import tensorflow as tf
A = tf.constant([[1, 2], [3, 4]])
indices = tf.constant([1, 0])
# prepare row indices
row_indices = tf.range(tf.shape(indices)[0])
# zip row indices with column indices
full_indices = tf.stack([row_indices, indices], axis=1)
# retrieve values by indices
S = tf.gather_nd(A, full_indices)
session = tf.InteractiveSession()
session.run(S)
You can use one hot method to create a one_hot array and use it as a boolean mask to select the indices you'd like.
A = tf.Variable([[1, 2], [3, 4]])
index = tf.Variable([0, 1])
one_hot_mask = tf.one_hot(index, A.shape[1], on_value = True, off_value = False, dtype = tf.bool)
output = tf.boolean_mask(A, one_hot_mask)
After dabbling around for quite a while. I found two functions that could be useful.
One is tf.gather_nd() which might be useful if you can produce a tensor
of the form [[0, 0], [1, 1]] and thereby you could do
index = tf.constant([[0, 0], [1, 1]])
tf.gather_nd(A, index)
If you are unable to produce a vector of the form [[0, 0], [1, 1]](I couldn't produce this as the number of rows in my case was dependent on a placeholder) for some reason then the work around I found is to use the tf.py_func(). Here is an example code on how this can be done
import tensorflow as tf
import numpy as np
def index_along_every_row(array, index):
N, _ = array.shape
return array[np.arange(N), index]
a = tf.Variable([[1, 2], [3, 4]], dtype=tf.int32)
index = tf.Variable([0, 1], dtype=tf.int32)
a_slice_op = tf.py_func(index_along_every_row, [a, index], [tf.int32])[0]
session = tf.InteractiveSession()
a.initializer.run()
index.initializer.run()
a_slice = a_slice_op.eval()
a_slice will be a numpy array [1, 4]
We can do the same using this combination of map_fn and gather_nd.
def get_element(a, indices):
"""
Outputs (ith element of indices) from (ith row of a)
"""
return tf.map_fn(lambda x: tf.gather_nd(x[0], x[1]),
(a, indices),
dtype = tf.float32)
Here's an example usage.
A = tf.constant(np.array([[1,2,3],
[4,5,6],
[7,8,9]], dtype = np.float32))
idx = tf.constant(np.array([[2],[1],[0]]))
elems = get_element(A, idx)
with tf.Session() as sess:
e = sess.run(elems)
print(e)
I don't know if this will be much slower than other answers.
It has the advantage that you don't need to specify the number of rows of A in advance, as long as a and indices have the same number of rows at runtime.
Note the output of the above will be rank 1. If you'd prefer it to have rank 2, replace gather_nd by gather
I couldn't get the accepted answer to work in Tensorflow 2 when I incorporated it into a loss function. Something about GradientTape didn't like it. My solution is an altered version of the accepted answer:
def get_rows(arr):
N, _ = arr.shape
return N
num_rows= tf.py_function(get_rows, [arr], [tf.int32])[0]
rng = tf.range(0,num_rows)
ind = tf.stack([rng, ind], axis=1)
tf.gather_nd(arr, ind)

Categories