Type errors with BERT example - python

I'm new to BERT QA model & was trying to follow the example found in this article. The problem is when I run the code attached to the example it produces a Type error as follows TypeError: argmax(): argument 'input' (position 1) must be Tensor, not str.
Here is the code that I've tried running :
import torch
from transformers import BertForQuestionAnswering
from transformers import BertTokenizer
#Model
model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
#Tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
question = '''SAMPLE QUESTION"'''
paragraph = '''SAMPLE PARAGRAPH'''
encoding = tokenizer.encode_plus(text=question,text_pair=paragraph, add_special=True)
inputs = encoding['input_ids'] #Token embeddings
sentence_embedding = encoding['token_type_ids'] #Segment embeddings
tokens = tokenizer.convert_ids_to_tokens(inputs) #input tokens
start_scores, end_scores = model(input_ids=torch.tensor([inputs]), token_type_ids=torch.tensor([sentence_embedding]))
start_index = torch.argmax(start_scores)
end_index = torch.argmax(end_scores)
answer = ' '.join(tokens[start_index:end_index+1])
The issue appears at line 13 of this code where I'm trying to get the maximum element in start_scores saying that this is not a tensor. When I tried printing this variable it showed "start_logits" as a string. Does anyone know a solution to this issue?

So after referring to the BERT Documentation we identified that the model output object contains multiple properties not only start & end scores. Thus, we applied the following changes to the code.
outputs = model(input_ids=torch.tensor([inputs]),token_type_ids=torch.tensor([sentence_embedding]))
start_index = torch.argmax(outputs.start_logits)
end_index = torch.argmax(outputs.end_logits)
answer = ' '.join(tokens[start_index:end_index+1])
Always refer to the documentation first :"D

Related

Understanding introductory example on transformers in Trax

My goal is to understand the introductory example on transformers in Trax, which can be found at https://trax-ml.readthedocs.io/en/latest/notebooks/trax_intro.html:
import trax
# Create a Transformer model.
# Pre-trained model config in gs://trax-ml/models/translation/ende_wmt32k.gin
model = trax.models.Transformer(
input_vocab_size=33300,
d_model=512, d_ff=2048,
n_heads=8, n_encoder_layers=6, n_decoder_layers=6,
max_len=2048, mode='predict')
# Initialize using pre-trained weights.
model.init_from_file('gs://trax-ml/models/translation/ende_wmt32k.pkl.gz',
weights_only=True)
# Tokenize a sentence.
sentence = 'It is nice to learn new things today!'
tokenized = list(trax.data.tokenize(iter([sentence]), # Operates on streams.
vocab_dir='gs://trax-ml/vocabs/',
vocab_file='ende_32k.subword'))[0]
# Decode from the Transformer.
tokenized = tokenized[None, :] # Add batch dimension.
tokenized_translation = trax.supervised.decoding.autoregressive_sample(
model, tokenized, temperature=0.0) # Higher temperature: more diverse results.
# De-tokenize,
tokenized_translation = tokenized_translation[0][:-1] # Remove batch and EOS.
translation = trax.data.detokenize(tokenized_translation,
vocab_dir='gs://trax-ml/vocabs/',
vocab_file='ende_32k.subword')
print(translation)
The example works pretty fine. However, when I try to translate another example with the initialised model, e.g.
sentence = 'I would like to try another example.'
tokenized = list(trax.data.tokenize(iter([sentence]),
vocab_dir='gs://trax-ml/vocabs/',
vocab_file='ende_32k.subword'))[0]
tokenized = tokenized[None, :]
tokenized_translation = trax.supervised.decoding.autoregressive_sample(
model, tokenized, temperature=0.0)
tokenized_translation = tokenized_translation[0][:-1]
translation = trax.data.detokenize(tokenized_translation,
vocab_dir='gs://trax-ml/vocabs/',
vocab_file='ende_32k.subword')
print(translation)
I get the output !, on my local machine as well as on Google Colab. The same happens with other examples.
When I build and initialise a new model, everything works fine.
Is this a bug? If not, what is happening here and how can I avoid/fix that behaviour?
Tokenization and detokenization seem to work well, I debugged that. Things seem to go wrong/unexpected in trax.supervised.decoding.autoregressive_sample.
I found it out myself... one needs to reset the model's state. So the following code works for me:
def translate(model, sentence, vocab_dir, vocab_file):
empty_state = model.state # save empty state
tokenized_sentence = next(trax.data.tokenize(iter([sentence]), vocab_dir=vocab_dir,
vocab_file=vocab_file))
tokenized_translation = trax.supervised.decoding.autoregressive_sample(
model, tokenized_sentence[None, :], temperature=0.0)[0][:-1]
translation = trax.data.detokenize(tokenized_translation, vocab_dir=vocab_dir,
vocab_file=vocab_file)
model.state = empty_state # reset state
return translation
# Create a Transformer model.
# Pre-trained model config in gs://trax-ml/models/translation/ende_wmt32k.gin
model = trax.models.Transformer(input_vocab_size=33300, d_model=512, d_ff=2048, n_heads=8,
n_encoder_layers=6, n_decoder_layers=6, max_len=2048,
mode='predict')
# Initialize using pre-trained weights.
model.init_from_file('gs://trax-ml/models/translation/ende_wmt32k.pkl.gz',
weights_only=True)
print(translate(model, 'It is nice to learn new things today!',
vocab_dir='gs://trax-ml/vocabs/', vocab_file='ende_32k.subword'))
print(translate(model, 'I would like to try another example.',
vocab_dir='gs://trax-ml/vocabs/', vocab_file='ende_32k.subword'))

How to slice string depending on length of tokens

When I use (with a long test_text and short question):
from transformers import BertTokenizer
import torch
from transformers import BertForQuestionAnswering
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
input_ids = tokenizer.encode(question, test_text)
print('Query has {:,} tokens.\n'.format(len(input_ids)))
sep_index = input_ids.index(tokenizer.sep_token_id)
num_seg_a = sep_index + 1
num_seg_b = len(input_ids) - num_seg_a
segment_ids = [0]*num_seg_a + [1]*num_seg_b
start_scores, end_scores = model(torch.tensor([input_ids]),
token_type_ids=torch.tensor([segment_ids]))
I get an error with the output
Token indices sequence length is longer than the specified maximum sequence length for this model (3 > 512). Running this sequence through the model will result in indexing errors
Query has 1,244 tokens.
How can I separate test_text into maximized length of chunks knowing that it won't exceed 512 tokens? And then ask the same question for each chunk of text, taking the best answer out of all of them, also going through the text twice with different slice points, in case the answer is cut during a slice.

Multiple answer spans in context, BERT question answering

I am writing a Question Answering system using pre-trained BERT with a linear layer and a softmax layer on top. When following the templates available on the net the labels of one example usually only consists of one answer_start_index and one answer_end_index. For example, from Huggingface when instantiating a SQUADFeatures object:
```
self.input_ids = input_ids
self.attention_mask = attention_mask
self.token_type_ids = token_type_ids
self.cls_index = cls_index
self.p_mask = p_mask
self.example_index = example_index
self.unique_id = unique_id
self.paragraph_len = paragraph_len
self.token_is_max_context = token_is_max_context
self.tokens = tokens
self.token_to_orig_map = token_to_orig_map
self.start_position = start_position
self.end_position = end_position
self.is_impossible = is_impossible
self.qas_id = qas_id
```
However, in my own dataset I have examples where the answer word is found at several locations in the context, i.e. there may be several correct spans constituting the answer.
My problem is I don't know how to manage such examples? In the templates available on the net labels are usually in a list, say:
[start_example1, start_example2, start_example3]
[end_example1, end_example2, end_example3]
In my case this may look like:
[start_example1, [start_example2_1, start_example2_2], start_example3]
and same for ends of course
In other words, I do not have a list containing one label per example, but a list containing either single-labels or a list of "labels" for an example, i.e. a list consisting of lists.
When following other templates the next step in the process is:
```
input_ids = torch.cat(input_ids, dim=0)
attention_masks = torch.cat(attention_masks, dim=0)
token_type_ids = torch.cat(token_type_ids, dim=0)
span_starts = torch(span_starts) #Something like this
span_ends = torch(span_ends) #Something like this
```
However this of course (?) raises an error as my span_start lists and span_end lists does not contain only single-items but sometimes a list within the list.
Anyone have an idea on how I can tackle this problem? Should I only use examples where there's only one span constituting the answer present in the context?
If I work around the torch-error, will the backpropagation / evaluation/ computation of loss still work?
Thank You! /B
Have you checked the code
from transformers import BertTokenizer, BertForQuestionAnswering
import torch
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"
encoding = tokenizer.encode_plus(question, text)
input_ids, token_type_ids = encoding["input_ids"], encoding["token_type_ids"]
start_scores, end_scores = model(torch.tensor([input_ids]), token_type_ids=torch.tensor([token_type_ids]))
all_tokens = tokenizer.convert_ids_to_tokens(input_ids)
answer = ' '.join(all_tokens[torch.argmax(start_scores) : torch.argmax(end_scores)+1])
assert answer == "a nice puppet"
I am not sure if this is the best way, but you may check instead of argmax to use topk, and check if this correspond to the correct answer.
t = torch.LongTensor([0,1,2,3,4,5,6,7,8,9])
t
_, indices = t.topk(4)
indices#([9, 8, 7, 6])

Implement a N-aryTreeLSTM version of the TreeLSTM in TensorFlow Fold

I'm trying to implement this paper TreeLSTM by using TensorFlow Fold. Actually, in Tensorflow Fold, there's already an example of TreeLSTM but in a BinaryTreeLSTM version, here's the tutorial: https://github.com/tensorflow/fold/blob/master/tensorflow_fold/g3doc/sentiment.ipynb
What I'm trying to do now is to implement a real NaryTreeLSTM, means that a LSTM node can be the parent of any number of children, not just 2 like in the above tutorial.
This is my attempt at trying to fold the tree, this is a modify version of logits_and_state() in the above example "
def logits_and_state():
"""Creates a block that goes from tokens to (logits, state) tuples."""
word2vec = (td.GetItem(0) >> td.InputTransform(lookup_word) >>
td.Scalar('int32') >> word_embedding)
children_num =
children2vec_list = list()
children2vec_list.append(embed_subtree())
for i in range(children_num):
children2vec_list.append(embed_subtree())
children2vec = tuple(children2vec_list)
# Trees are binary, so the tree layer takes two states as its input_state.
zero_state = td.Zeros((tree_lstm.state_size,) * 2)
# Input is a word vector.
zero_inp = td.Zeros(word_embedding.output_type.shape[0])
# word_case =
word_case = td.AllOf(word2vec, zero_state)
children_case = td.AllOf(zero_inp, children2vec)
tree2vec = td.OneOf(lambda x: 1 if len(x) == 1 else 2), [(1,word_case),(2,children_case)])
return tree2vec >> tree_lstm >> (output_layer, td.Identity())
The children_num is the thing that I'm struggling at this moment, I have no idea to get out that number, eventhought I know that the length of children can be obtained by td.GetItem(1) ==> will produce a block that contains an array of children ==> how to get out the real number of that block?
You may say that I should try PyTorch or some others DL framework that also provides Dynamic Computation Graph, but in my case, the requirement is strict with Tensorflow Fold.

How to input test data using the DecisionTree module in python?

On the Python DescisionTree module homepage (DecisionTree-1.6.1), they give a piece of example code. Here it is:
dt = DecisionTree( training_datafile = "training.dat", debug1 = 1 )
dt.get_training_data()
dt.show_training_data()
root_node = dt.construct_decision_tree_classifier()
root_node.display_decision_tree(" ")
test_sample = ['exercising=>never', 'smoking=>heavy',
'fatIntake=>heavy', 'videoAddiction=>heavy']
classification = dt.classify(root_node, test_sample)
print "Classification: ", classification
My question is: How can I specify sample data (test_sample here) from variables? On the project homepage, it says: "You classify new data by first constructing a new data vector:" I have searched around but have been unable to find out what a data vector is or the answer to my question.
Any help would be appreciated!
Um, the example says it all. It's a list of strings, with features and values separated by '=>'. To use the example, a feature is 'exercising', and the value is 'never'.

Categories