Output of LawformerModel - python

I am using the Lawformer model to do some tasks.But I am confused by the output.Could you explain what does the output tensor represent?Also, when my input is a list consisting several strings, the second dimension of the output will be padded to the largest num of tokens. Why is this different from Sentencetransformer, whose output is a fixed-length tensor?

Related

Keras Sequence of sequences

Let's say I have an Input to the model as sequence of items: [3,2,1,4,5,6]. Then I can make Keras Input Layer shape=(SEQUENCE_LEN, ) and then add Embedding Layer. But I have an Input of sequences of sequences of pairs. For example : [[(3, 1),(2,3),(1,5)], [(4, 1),(5,2)], [(6, 5),(7, 1),(8, 5)]]
I have pre-trained embeddings for first items in each pair and I want to have trainable Embedding Layer for second items in pairs. The second item in each pair roughly speaking means the amount of a first item in some context. Which first layers do you suggest to use to convert this problem to usual sequence classification? I suspect that I would need more than just a combination of Embedding and Input Layer
UPD: Even if I just concat these sequences into one, how do I cope with second value in each pair? What is the best way to use information about "amount" of items?

Mask certain indices for every entry in a batch, when using torch.max()

I am incremently sampling a batch of size torch.Size([n, 8]).
I also have a list valid_indices of length n which contains tuples of indices that are valid for each entry in the batch.
For instance valid_indices[0] may look like this: (0,1,3,4,5,7) , which suggests that indices 2 and 6 should be excluded from the first entry in batch along dim 1.
Particularly I need to exclude these values for when I use torch.max(batch, dim=1, keepdim=True).
Indices to be excluded (if any) may differ from entry to entry within the batch.
Any ideas? Thanks in advance.
I assume that you are getting the good old
IndexError: too many indices for tensor of dimension 1
error when you use your tuple indices directly on the tensor.
At least that was the error that I was able to reproduce when I execute the following line
t[0][valid_idx0]
Where t is a random tensor with size (10,8) and valid_idx0 is a tuple with 4 elements.
However, same line works just fine when you convert your tuple to a list as following
t[0][list(valid_idx0)]
>>> tensor([0.1847, 0.1028, 0.7130, 0.5093])
But when it comes to applying these indices to 2D tensors, things get a bit different, since we need to preserve the structure of our tensor for batch processing.
Therefore, it would be reasonable to convert our indices to mask arrays.
Let's say we have a list of tuples valid_indices at hand. First thing will be converting it to a list of lists.
valid_idx_list = [list(tup) for tup in valid_indices]
Second thing will be converting them to mask arrays.
masks = np.zeros((t.size()))
for i, indices in enumerate(valid_idx_list):
masks[i][indices] = 1
Done. Now we can apply our mask and use the torch.max on the masked tensor.
torch.max(t*masks)
Kindly see the colab notebook that I've used to reproduce the problem.
https://colab.research.google.com/drive/1BhKKgxk3gRwUjM8ilmiqgFvo0sfXMGiK?usp=sharing

How to transform a 2D and index tensors for torch.nn.utils.rnn.pack_sequence

I have a sequences collection in the following form:
sequences = torch.tensor([[2,1],[5,6],[3,0])
indexes = torch.tensor([1,0,1])
that is, the sequence 0 is made of just [5,6], and the sequence 1 is made of [2,1] , [3,0]. Mathematically sequence[i] = { sequences[j] such that i = indexes[j] }
I need to feed these sequences into an LSTM. Since these are variable-length sequences, pytorch documentation states to use something like torch.nn.utils.rnn.pack_sequence.
Sadly, this method and its like want, as input, a list of tensors where each of them is a L x *, with L being the length of the single sequence.
How can build something that can be fed into a pytorch LSTM?
P.s. throughout the code I work with these tensors using scatter and gather functionalities but I can't find a way to use them to achieve this goal.
First of all, you need to separate your sequences. Pack_sequence accepts a list of tensors, each tensor being the shape L x *. The other dimensions must always be the same for all sequences, but L, or the sequence length can be varying. For example, your sequence 0 and 1 can be packed as:
sequences = [torch.tensor([[5,6]]), torch.tensor([[2,1],[3,0]])]
packed_seq = torch.nn.utils.rnn.pack_sequence(sequences, enforce_sorted=False)
Here, in sequences, sequences[0] is of shape (1,2) while sequences[1] is of shape (2,2). The first dimension represents their length, which is 1 and 2 respectively.
You can separate the sequences by:
sequences = torch.tensor([[2,1],[5,6],[3,0]])
indexes = torch.tensor([1,0,1])
num_seq = np.unique(indexes)
sequences = [sequences[indexes==seq_id] for seq_id in num_seq]
This creates sequences=[torch.tensor([[5,6]]), torch.tensor([[2,1],[3,0]])].
I found an alternative and more efficient way to separate the sequences:
sequences = torch.tensor([[2,1],[5,6],[3,0]])
indexes = torch.tensor([1,0,1])
sorted_src = src[indexes.argsort()]
indexes_count = torch.unique(indexes, return_counts=True)[1]
splitted = torch.split(sorted_src, indexes_count.tolist(), dim=0)
This method is almost 3 times faster then the one proposed by #Mercury.
Measured using timeit module with sequences being a (5000,256) tensor and indexes being (1500)

string concatenation in tensorflow

I have a tf.string tensor, chars, with shape chars[Batch][None] where None denotes a dynamic shaped tensor (output from a variable length sequence).
If this tensor's shape were known (e.g. chars[Batch][Time]), then I could achieve concatenation of strings along the last dimension as:
chars = tf.split(chars,chars.shape[-1],axis=-1)
words = tf.squeeze(tf.strings.join(chars))
However, since the shape is unknown until runtime, I cannot use split.
Is there another way to accomplish this for a dynamic shaped string tensor?
In other words, I would like the string analogy of
words = tf.reduce_sum(chars,axis=-1)
along a dynamic shaped dimension.
Update 23/07/2022: Now you can use tf.strings.reduce_join to join all strings into a single string, or joins along an axis
words = tf.strings.reduce_join(chars, axis=-1)
This can be accomplished via:
words = tf.reduce_join(chars,axis=-1)

CNTK RuntimeError: AddSequence: Sequences must be a least one frame long

I am getting error in following code:
x = cntk.input_variable(shape=c(8,3,1))
y = cntk.sequence.slice(x,1,0)
x0 = np.reshape(np.arange(48.0,dtype=np.float32),(2,8,1,3))
y.eval({x:x0})
Error : Sequences must be a least one frame long
But when I run this it runs fine :
x = cntk.input_variable(shape=c(3,2)) #change
y = cntk.sequence.slice(x,1,0)
x0 = np.reshape(np.arange(24.0,dtype=np.float32),(1,8,1,3)) #change
y.eval({x:x0})
I am not able to understand few things which slice method :
At what array level it's going to slice.
Acc. to documentation, second argument is begin_index, and next to it it end_index. How can being_index be greater than end_index.
There are two versions of slice(), one for slicing tensors, and one for slicing sequences. Your example uses the one for sequences.
If your inputs are sequences (e.g. words), the first form, cntk.slice(), would individually slice every element of a sequence and create a sequence of the same length that consists of sliced tensors. The second form, cntk.sequence.slice(), will slice out a range of entries from the sequence. E.g. cntk.sequence.slice(x, 13, 42) will cut out sequence items 13..41 from x, and create a new sequence of length (42-13).
If you intended to experiment with the first form, please change to cntk.slice(). If you meant the sequence version, please try to enclose x0 in an additional [...]. The canonical form of passing minibatch data is as a list of batch entries (e.g. MB size of 128 --> a list with 128 entries), where each batch entry is a tensor of shape (Ti,) + input_shape where Ti is the sequence length of the respective sequence. This
x0 = [ np.reshape(np.arange(48.0,dtype=np.float32),(2,8,1,3)) ]
would denote a minibatch with a single entry (1 list entry), where the entry is a sequence of 2 sequence items, where each sequence item has shape (8,1,3).
The begin and end indices can be negative, in order to index from the end (similar to Python slices). Unlike Python however, 0 is a valid end index that refers to the end.

Categories