string concatenation in tensorflow - python

I have a tf.string tensor, chars, with shape chars[Batch][None] where None denotes a dynamic shaped tensor (output from a variable length sequence).
If this tensor's shape were known (e.g. chars[Batch][Time]), then I could achieve concatenation of strings along the last dimension as:
chars = tf.split(chars,chars.shape[-1],axis=-1)
words = tf.squeeze(tf.strings.join(chars))
However, since the shape is unknown until runtime, I cannot use split.
Is there another way to accomplish this for a dynamic shaped string tensor?
In other words, I would like the string analogy of
words = tf.reduce_sum(chars,axis=-1)
along a dynamic shaped dimension.

Update 23/07/2022: Now you can use tf.strings.reduce_join to join all strings into a single string, or joins along an axis
words = tf.strings.reduce_join(chars, axis=-1)
This can be accomplished via:
words = tf.reduce_join(chars,axis=-1)

Related

Output of LawformerModel

I am using the Lawformer model to do some tasks.But I am confused by the output.Could you explain what does the output tensor represent?Also, when my input is a list consisting several strings, the second dimension of the output will be padded to the largest num of tokens. Why is this different from Sentencetransformer, whose output is a fixed-length tensor?

Mask certain indices for every entry in a batch, when using torch.max()

I am incremently sampling a batch of size torch.Size([n, 8]).
I also have a list valid_indices of length n which contains tuples of indices that are valid for each entry in the batch.
For instance valid_indices[0] may look like this: (0,1,3,4,5,7) , which suggests that indices 2 and 6 should be excluded from the first entry in batch along dim 1.
Particularly I need to exclude these values for when I use torch.max(batch, dim=1, keepdim=True).
Indices to be excluded (if any) may differ from entry to entry within the batch.
Any ideas? Thanks in advance.
I assume that you are getting the good old
IndexError: too many indices for tensor of dimension 1
error when you use your tuple indices directly on the tensor.
At least that was the error that I was able to reproduce when I execute the following line
t[0][valid_idx0]
Where t is a random tensor with size (10,8) and valid_idx0 is a tuple with 4 elements.
However, same line works just fine when you convert your tuple to a list as following
t[0][list(valid_idx0)]
>>> tensor([0.1847, 0.1028, 0.7130, 0.5093])
But when it comes to applying these indices to 2D tensors, things get a bit different, since we need to preserve the structure of our tensor for batch processing.
Therefore, it would be reasonable to convert our indices to mask arrays.
Let's say we have a list of tuples valid_indices at hand. First thing will be converting it to a list of lists.
valid_idx_list = [list(tup) for tup in valid_indices]
Second thing will be converting them to mask arrays.
masks = np.zeros((t.size()))
for i, indices in enumerate(valid_idx_list):
masks[i][indices] = 1
Done. Now we can apply our mask and use the torch.max on the masked tensor.
torch.max(t*masks)
Kindly see the colab notebook that I've used to reproduce the problem.
https://colab.research.google.com/drive/1BhKKgxk3gRwUjM8ilmiqgFvo0sfXMGiK?usp=sharing

How to transform a 2D and index tensors for torch.nn.utils.rnn.pack_sequence

I have a sequences collection in the following form:
sequences = torch.tensor([[2,1],[5,6],[3,0])
indexes = torch.tensor([1,0,1])
that is, the sequence 0 is made of just [5,6], and the sequence 1 is made of [2,1] , [3,0]. Mathematically sequence[i] = { sequences[j] such that i = indexes[j] }
I need to feed these sequences into an LSTM. Since these are variable-length sequences, pytorch documentation states to use something like torch.nn.utils.rnn.pack_sequence.
Sadly, this method and its like want, as input, a list of tensors where each of them is a L x *, with L being the length of the single sequence.
How can build something that can be fed into a pytorch LSTM?
P.s. throughout the code I work with these tensors using scatter and gather functionalities but I can't find a way to use them to achieve this goal.
First of all, you need to separate your sequences. Pack_sequence accepts a list of tensors, each tensor being the shape L x *. The other dimensions must always be the same for all sequences, but L, or the sequence length can be varying. For example, your sequence 0 and 1 can be packed as:
sequences = [torch.tensor([[5,6]]), torch.tensor([[2,1],[3,0]])]
packed_seq = torch.nn.utils.rnn.pack_sequence(sequences, enforce_sorted=False)
Here, in sequences, sequences[0] is of shape (1,2) while sequences[1] is of shape (2,2). The first dimension represents their length, which is 1 and 2 respectively.
You can separate the sequences by:
sequences = torch.tensor([[2,1],[5,6],[3,0]])
indexes = torch.tensor([1,0,1])
num_seq = np.unique(indexes)
sequences = [sequences[indexes==seq_id] for seq_id in num_seq]
This creates sequences=[torch.tensor([[5,6]]), torch.tensor([[2,1],[3,0]])].
I found an alternative and more efficient way to separate the sequences:
sequences = torch.tensor([[2,1],[5,6],[3,0]])
indexes = torch.tensor([1,0,1])
sorted_src = src[indexes.argsort()]
indexes_count = torch.unique(indexes, return_counts=True)[1]
splitted = torch.split(sorted_src, indexes_count.tolist(), dim=0)
This method is almost 3 times faster then the one proposed by #Mercury.
Measured using timeit module with sequences being a (5000,256) tensor and indexes being (1500)

Tensorflow - Retrieve each character in a string tensor

I'm trying to retrieve the characters in a string tensor for character level prediction. The ground truths are words where each character has an id in dictionary. I have a tensor corresponding to the length of the string.
Now, I have to get each character in the string tensor. After checking the related posts, a simple retrieval can be as follows. Example string is "This"
a= tf.constant("This",shape=[1])
b=tf.string_split(a,delimiter="").values #Sparse tensor has the values array which stores characters
Now I want to make a string with spaces in between the letters "This" i.e " T h i s ". I need spacing at the start and the end too.
How do I do this?
I have tried to iterate through the characters like below
for i in xrange(b.dense_shape[1]): # b.dense_shape[1] has the length of string
x=b.values[i]
But the loop expects an integer rather than a tensor.
Any idea on how to do the above tasks? I couldn't find any documentation related to this (apart from the tf.string_split function). Any suggestions are welcome. Thanks
Your problem is that you are trying to iterate over Tensor, that is not iterable. There is some alternatives for this task, such as convert it to numpy array with eval() or use the tf.map_fn.
If you want to threat b as numpy array you only need to add the call .eval() before .values and iterate over the result as follows:
with tf.Session() as sess:
a = tf.constant("This", shape=[1])
b = tf.string_split(a, delimiter="").values.eval()
for i in b:
print(i)
The second alternative is more appropriate because of it takes advantage of TensorFlow's graph. It is based in the use of a function that "maps" the Tensor. This can be done as follows (where in fn you can define de behavior of the iteration):
with tf.Session() as sess:
a = tf.constant("This", shape=[1])
b = tf.string_split(a, delimiter="").values
fn = lambda i: i
print(tf.map_fn(fn, b).eval())

CNTK RuntimeError: AddSequence: Sequences must be a least one frame long

I am getting error in following code:
x = cntk.input_variable(shape=c(8,3,1))
y = cntk.sequence.slice(x,1,0)
x0 = np.reshape(np.arange(48.0,dtype=np.float32),(2,8,1,3))
y.eval({x:x0})
Error : Sequences must be a least one frame long
But when I run this it runs fine :
x = cntk.input_variable(shape=c(3,2)) #change
y = cntk.sequence.slice(x,1,0)
x0 = np.reshape(np.arange(24.0,dtype=np.float32),(1,8,1,3)) #change
y.eval({x:x0})
I am not able to understand few things which slice method :
At what array level it's going to slice.
Acc. to documentation, second argument is begin_index, and next to it it end_index. How can being_index be greater than end_index.
There are two versions of slice(), one for slicing tensors, and one for slicing sequences. Your example uses the one for sequences.
If your inputs are sequences (e.g. words), the first form, cntk.slice(), would individually slice every element of a sequence and create a sequence of the same length that consists of sliced tensors. The second form, cntk.sequence.slice(), will slice out a range of entries from the sequence. E.g. cntk.sequence.slice(x, 13, 42) will cut out sequence items 13..41 from x, and create a new sequence of length (42-13).
If you intended to experiment with the first form, please change to cntk.slice(). If you meant the sequence version, please try to enclose x0 in an additional [...]. The canonical form of passing minibatch data is as a list of batch entries (e.g. MB size of 128 --> a list with 128 entries), where each batch entry is a tensor of shape (Ti,) + input_shape where Ti is the sequence length of the respective sequence. This
x0 = [ np.reshape(np.arange(48.0,dtype=np.float32),(2,8,1,3)) ]
would denote a minibatch with a single entry (1 list entry), where the entry is a sequence of 2 sequence items, where each sequence item has shape (8,1,3).
The begin and end indices can be negative, in order to index from the end (similar to Python slices). Unlike Python however, 0 is a valid end index that refers to the end.

Categories