Tensorflow - Retrieve each character in a string tensor

Tensorflow - Retrieve each character in a string tensor - python

I'm trying to retrieve the characters in a string tensor for character level prediction. The ground truths are words where each character has an id in dictionary. I have a tensor corresponding to the length of the string.
Now, I have to get each character in the string tensor. After checking the related posts, a simple retrieval can be as follows. Example string is "This"
a= tf.constant("This",shape=[1])
b=tf.string_split(a,delimiter="").values #Sparse tensor has the values array which stores characters
Now I want to make a string with spaces in between the letters "This" i.e " T h i s ". I need spacing at the start and the end too.
How do I do this?
I have tried to iterate through the characters like below
for i in xrange(b.dense_shape[1]): # b.dense_shape[1] has the length of string
x=b.values[i]
But the loop expects an integer rather than a tensor.
Any idea on how to do the above tasks? I couldn't find any documentation related to this (apart from the tf.string_split function). Any suggestions are welcome. Thanks

Your problem is that you are trying to iterate over Tensor, that is not iterable. There is some alternatives for this task, such as convert it to numpy array with eval() or use the tf.map_fn.
If you want to threat b as numpy array you only need to add the call .eval() before .values and iterate over the result as follows:
with tf.Session() as sess:
a = tf.constant("This", shape=[1])
b = tf.string_split(a, delimiter="").values.eval()
for i in b:
print(i)
The second alternative is more appropriate because of it takes advantage of TensorFlow's graph. It is based in the use of a function that "maps" the Tensor. This can be done as follows (where in fn you can define de behavior of the iteration):
with tf.Session() as sess:
a = tf.constant("This", shape=[1])
b = tf.string_split(a, delimiter="").values
fn = lambda i: i
print(tf.map_fn(fn, b).eval())

Related

Mask certain indices for every entry in a batch, when using torch.max()

I am incremently sampling a batch of size torch.Size([n, 8]).
I also have a list valid_indices of length n which contains tuples of indices that are valid for each entry in the batch.
For instance valid_indices[0] may look like this: (0,1,3,4,5,7) , which suggests that indices 2 and 6 should be excluded from the first entry in batch along dim 1.
Particularly I need to exclude these values for when I use torch.max(batch, dim=1, keepdim=True).
Indices to be excluded (if any) may differ from entry to entry within the batch.
Any ideas? Thanks in advance.

I assume that you are getting the good old
IndexError: too many indices for tensor of dimension 1
error when you use your tuple indices directly on the tensor.
At least that was the error that I was able to reproduce when I execute the following line
t[0][valid_idx0]
Where t is a random tensor with size (10,8) and valid_idx0 is a tuple with 4 elements.
However, same line works just fine when you convert your tuple to a list as following
t[0][list(valid_idx0)]
>>> tensor([0.1847, 0.1028, 0.7130, 0.5093])
But when it comes to applying these indices to 2D tensors, things get a bit different, since we need to preserve the structure of our tensor for batch processing.
Therefore, it would be reasonable to convert our indices to mask arrays.
Let's say we have a list of tuples valid_indices at hand. First thing will be converting it to a list of lists.
valid_idx_list = [list(tup) for tup in valid_indices]
Second thing will be converting them to mask arrays.
masks = np.zeros((t.size()))
for i, indices in enumerate(valid_idx_list):
masks[i][indices] = 1
Done. Now we can apply our mask and use the torch.max on the masked tensor.
torch.max(t*masks)
Kindly see the colab notebook that I've used to reproduce the problem.
https://colab.research.google.com/drive/1BhKKgxk3gRwUjM8ilmiqgFvo0sfXMGiK?usp=sharing

How to transform a 2D and index tensors for torch.nn.utils.rnn.pack_sequence

I have a sequences collection in the following form:
sequences = torch.tensor([[2,1],[5,6],[3,0])
indexes = torch.tensor([1,0,1])
that is, the sequence 0 is made of just [5,6], and the sequence 1 is made of [2,1] , [3,0]. Mathematically sequence[i] = { sequences[j] such that i = indexes[j] }
I need to feed these sequences into an LSTM. Since these are variable-length sequences, pytorch documentation states to use something like torch.nn.utils.rnn.pack_sequence.
Sadly, this method and its like want, as input, a list of tensors where each of them is a L x *, with L being the length of the single sequence.
How can build something that can be fed into a pytorch LSTM?
P.s. throughout the code I work with these tensors using scatter and gather functionalities but I can't find a way to use them to achieve this goal.

First of all, you need to separate your sequences. Pack_sequence accepts a list of tensors, each tensor being the shape L x *. The other dimensions must always be the same for all sequences, but L, or the sequence length can be varying. For example, your sequence 0 and 1 can be packed as:
sequences = [torch.tensor([[5,6]]), torch.tensor([[2,1],[3,0]])]
packed_seq = torch.nn.utils.rnn.pack_sequence(sequences, enforce_sorted=False)
Here, in sequences, sequences[0] is of shape (1,2) while sequences[1] is of shape (2,2). The first dimension represents their length, which is 1 and 2 respectively.
You can separate the sequences by:
sequences = torch.tensor([[2,1],[5,6],[3,0]])
indexes = torch.tensor([1,0,1])
num_seq = np.unique(indexes)
sequences = [sequences[indexes==seq_id] for seq_id in num_seq]
This creates sequences=[torch.tensor([[5,6]]), torch.tensor([[2,1],[3,0]])].

I found an alternative and more efficient way to separate the sequences:
sequences = torch.tensor([[2,1],[5,6],[3,0]])
indexes = torch.tensor([1,0,1])
sorted_src = src[indexes.argsort()]
indexes_count = torch.unique(indexes, return_counts=True)[1]
splitted = torch.split(sorted_src, indexes_count.tolist(), dim=0)
This method is almost 3 times faster then the one proposed by #Mercury.
Measured using timeit module with sequences being a (5000,256) tensor and indexes being (1500)

Iterating through tensors in pytorch

I have two 1D tensors. One is a vector of predictions, second is vector of labels. I am trying to write a loop that checks the element-wise difference between vectors. If such diff is spotted, I want to do another operation, for simplicity let's say I want to print ("Diff spotted"). So far I came up with this but I got an error: Expected object of scalar type Byte but got scalar type Float for argument #2 'other'. I would appreciate help here. Maybe there is some more efficient way to do it, without loop.
for i in enumerate(t1):
if t1[i] != t2[i]:
print("Diff spotted")

You can use the eq() function in pytorch to check if to tensors are the same element-wise. For every index of an element which is the same as the labels element you get a True:
for label in predictions.round().eq(labels):
for element in label:
if element == False:
print("Diff spotted!")

string concatenation in tensorflow

I have a tf.string tensor, chars, with shape chars[Batch][None] where None denotes a dynamic shaped tensor (output from a variable length sequence).
If this tensor's shape were known (e.g. chars[Batch][Time]), then I could achieve concatenation of strings along the last dimension as:
chars = tf.split(chars,chars.shape[-1],axis=-1)
words = tf.squeeze(tf.strings.join(chars))
However, since the shape is unknown until runtime, I cannot use split.
Is there another way to accomplish this for a dynamic shaped string tensor?
In other words, I would like the string analogy of
words = tf.reduce_sum(chars,axis=-1)
along a dynamic shaped dimension.

Update 23/07/2022: Now you can use tf.strings.reduce_join to join all strings into a single string, or joins along an axis
words = tf.strings.reduce_join(chars, axis=-1)
This can be accomplished via:
words = tf.reduce_join(chars,axis=-1)

Working with variable-length text in Tensorflow

I am building a Tensorflow model to perform inference on text phrases.
For sake of simplicity, assume I need a classifier with fixed number of output classes but a variable-length text in input. In other words, my mini batch would be a sequence of phrases but not all phrases have the same length.
data = ['hello',
'my name is Mark',
'What is your name?']
My first preprocessing step was to build a dictionary of all possible words in the dictionary and map each word to its integer word-Id. The input becomes:
data = [[1],
[2, 3, 4, 5],
[6, 4, 7, 3]
What's the best way to handle this kind of input? Can tf.placeholder() handle variable-size input within the same batch of data?
Or should I pad all strings such that they all have the same length, equal to the length of the longest string, using some placeholder for the missing words? This seems to be very memory inefficient if some string are much longer that most of the others.
-- EDIT --
Here is a concrete example.
When I know the size of my datapoints (and all the datapoint have the same length, eg. 3) I normally use something like:
input = tf.placeholder(tf.int32, shape=(None, 3)
with tf.Session() as sess:
print(sess.run([...], feed_dict={input:[[1, 2, 3], [1, 2, 3]]}))
where the first dimension of the placeholder is the minibatch size.
What if the input sequences are words in sentences of different length?
feed_dict={input:[[1, 2, 3], [1]]}

The other two answers are correct, but low on details. I was just looking at how to do this myself.
There is machinery in TensorFlow to to all of this (for some parts it may be overkill).
Starting from a string tensor (shape [3]):
import tensorflow as tf
lines = tf.constant([
'Hello',
'my name is also Mark',
'Are there any other Marks here ?'])
vocabulary = ['Hello', 'my', 'name', 'is', 'also', 'Mark', 'Are', 'there', 'any', 'other', 'Marks', 'here', '?']
The first thing to do is split this into words (note the space before the question mark.)
words = tf.string_split(lines," ")
Words will now be a sparse tensor (shape [3,7]). Where the two dimensions of the indices are [line number, position]. This is represented as:
indices values
0 0 'hello'
1 0 'my'
1 1 'name'
1 2 'is'
...
Now you can do a word lookup:
table = tf.contrib.lookup.index_table_from_tensor(vocabulary)
word_indices = table.lookup(words)
This returns a sparse tensor with the words replaced by their vocabulary indices.
Now you can read out the sequence lengths by looking at the maximum position on each line :
line_number = word_indices.indices[:,0]
line_position = word_indices.indices[:,1]
lengths = tf.segment_max(data = line_position,
segment_ids = line_number)+1
So if you're processing variable length sequences it's probably to put in an lstm ... so let's use a word-embedding for the input (it requires a dense input):
EMBEDDING_DIM = 100
dense_word_indices = tf.sparse_tensor_to_dense(word_indices)
e_layer = tf.contrib.keras.layers.Embedding(len(vocabulary), EMBEDDING_DIM)
embedded = e_layer(dense_word_indices)
Now embedded will have a shape of [3,7,100], [lines, words, embedding_dim].
Then a simple lstm can be built:
LSTM_SIZE = 50
lstm = tf.nn.rnn_cell.BasicLSTMCell(LSTM_SIZE)
And run the across the sequence, handling the padding.
outputs, final_state = tf.nn.dynamic_rnn(
cell=lstm,
inputs=embedded,
sequence_length=lengths,
dtype=tf.float32)
Now outputs has a shape of [3,7,50], or [line,word,lstm_size]. If you want to grab the state at the last word of each line you can use the (hidden! undocumented!) select_last_activations function:
from tensorflow.contrib.learn.python.learn.estimators.rnn_common import select_last_activations
final_output = select_last_activations(outputs,tf.cast(lengths,tf.int32))
That does all the index shuffling to select the output from the last timestep. This gives a size of [3,50] or [line, lstm_size]
init_t = tf.tables_initializer()
init = tf.global_variables_initializer()
with tf.Session() as sess:
init_t.run()
init.run()
print(final_output.eval().shape())
I haven't worked out the details yet but I think this could probably all be replaced by a single tf.contrib.learn.DynamicRnnEstimator.

How about this? (I didn’t implement this. but maybe this idea will work.)
This method is based on BOW representation.
Get your data as tf.string
Split it using tf.string_split
Find indexes of your words using tf.contrib.lookup.string_to_index_table_from_file or tf.contrib.lookup.string_to_index_table_from_tensor. Length of this tensor can vary.
Find embeddings of your indexes.
word_embeddings = tf.get_variable(“word_embeddings”,
[vocabulary_size, embedding_size])
embedded_word_ids = tf.nn.embedding_lookup(word_embeddings, word_ids)`
Sum up the embeddings. And you will get a tensor of fixed length(=embedding size). Maybe you can choose another method then sum.(avg, mean or something else)
Maybe it’s too late :) Good luck.

I was building a sequence to sequence translator the other day. What I did is decided to do was make it for a fixed length of 32 words (which was a bit above the average sentence length) although you can make it as long as you want. I then added a NULL word to the dictionary and padded all my sentence vectors with it. That way I could tell the model where the end of my sequence was and the model would just output NULL at the end of its output. For instance take the expression "Hi what is your name?" This would become "Hi what is your name? NULL NULL NULL NULL ... NULL". It worked pretty well but your loss and accuracy during training will appear a bit higher than it actually is since the model usually gets the NULLs right which count towards the cost.
There is another approach called masking. This too allows you to build a model for a fixed length sequence but only evaluate the cost up to the end of a shorter sequence. You could search for the first instance of NULL in the output sequence (or expected output, whichever is greater) and only evaluate the cost up to that point. Also I think some tensor flow functions like tf.dynamic_rnn support masking which may be more memory efficient. I am not sure since I have only tried the first approach of padding.
Finally, I think in the tensorflow example of Seq2Seq model they use buckets for different sized sequences. This would probably solve your memory issue. I think you could share the variables between the different sized models.

So here is what I did (not sure if its 100% the right way to be honest):
In your vocab dict where each key is a number pointing to one particular word, add another key say K which points to "<PAD>"(or any other representation you want to use for padding)
Now your placeholder for input would look something like this:
x_batch = tf.placeholder(tf.int32, shape=(batch_size, None))
where None represents the largest phrase/sentence/record in your mini batch.
Another small trick I used was to store the length of each phrase in my mini batch. For example:
If my input was: x_batch = [[1], [1,2,3], [4,5]]
then I store: len_batch = [1, 3, 2]
Later I use this len_batch and the max size of a phrase(l_max) in my minibatch to create a binary mask. Now l_max=3 from above, so my mask would look something like this:
mask = [
[1, 0, 0],
[1, 1, 1],
[1, 1, 0]
]
Now if you multiply this with your loss you would basically eliminate all loss introduced as a result of padding.
Hope this helps.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Tensorflow - Retrieve each character in a string tensor - python

Related

Mask certain indices for every entry in a batch, when using torch.max()

How to transform a 2D and index tensors for torch.nn.utils.rnn.pack_sequence

Iterating through tensors in pytorch

string concatenation in tensorflow

Working with variable-length text in Tensorflow

Categories

Resources