Multiple answer spans in context, BERT question answering - python

I am writing a Question Answering system using pre-trained BERT with a linear layer and a softmax layer on top. When following the templates available on the net the labels of one example usually only consists of one answer_start_index and one answer_end_index. For example, from Huggingface when instantiating a SQUADFeatures object:
```
self.input_ids = input_ids
self.attention_mask = attention_mask
self.token_type_ids = token_type_ids
self.cls_index = cls_index
self.p_mask = p_mask
self.example_index = example_index
self.unique_id = unique_id
self.paragraph_len = paragraph_len
self.token_is_max_context = token_is_max_context
self.tokens = tokens
self.token_to_orig_map = token_to_orig_map
self.start_position = start_position
self.end_position = end_position
self.is_impossible = is_impossible
self.qas_id = qas_id
```
However, in my own dataset I have examples where the answer word is found at several locations in the context, i.e. there may be several correct spans constituting the answer.
My problem is I don't know how to manage such examples? In the templates available on the net labels are usually in a list, say:
[start_example1, start_example2, start_example3]
[end_example1, end_example2, end_example3]
In my case this may look like:
[start_example1, [start_example2_1, start_example2_2], start_example3]
and same for ends of course
In other words, I do not have a list containing one label per example, but a list containing either single-labels or a list of "labels" for an example, i.e. a list consisting of lists.
When following other templates the next step in the process is:
```
input_ids = torch.cat(input_ids, dim=0)
attention_masks = torch.cat(attention_masks, dim=0)
token_type_ids = torch.cat(token_type_ids, dim=0)
span_starts = torch(span_starts) #Something like this
span_ends = torch(span_ends) #Something like this
```
However this of course (?) raises an error as my span_start lists and span_end lists does not contain only single-items but sometimes a list within the list.
Anyone have an idea on how I can tackle this problem? Should I only use examples where there's only one span constituting the answer present in the context?
If I work around the torch-error, will the backpropagation / evaluation/ computation of loss still work?
Thank You! /B

Have you checked the code
from transformers import BertTokenizer, BertForQuestionAnswering
import torch
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"
encoding = tokenizer.encode_plus(question, text)
input_ids, token_type_ids = encoding["input_ids"], encoding["token_type_ids"]
start_scores, end_scores = model(torch.tensor([input_ids]), token_type_ids=torch.tensor([token_type_ids]))
all_tokens = tokenizer.convert_ids_to_tokens(input_ids)
answer = ' '.join(all_tokens[torch.argmax(start_scores) : torch.argmax(end_scores)+1])
assert answer == "a nice puppet"
I am not sure if this is the best way, but you may check instead of argmax to use topk, and check if this correspond to the correct answer.
t = torch.LongTensor([0,1,2,3,4,5,6,7,8,9])
t
_, indices = t.topk(4)
indices#([9, 8, 7, 6])

Related

Type errors with BERT example

I'm new to BERT QA model & was trying to follow the example found in this article. The problem is when I run the code attached to the example it produces a Type error as follows TypeError: argmax(): argument 'input' (position 1) must be Tensor, not str.
Here is the code that I've tried running :
import torch
from transformers import BertForQuestionAnswering
from transformers import BertTokenizer
#Model
model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
#Tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
question = '''SAMPLE QUESTION"'''
paragraph = '''SAMPLE PARAGRAPH'''
encoding = tokenizer.encode_plus(text=question,text_pair=paragraph, add_special=True)
inputs = encoding['input_ids'] #Token embeddings
sentence_embedding = encoding['token_type_ids'] #Segment embeddings
tokens = tokenizer.convert_ids_to_tokens(inputs) #input tokens
start_scores, end_scores = model(input_ids=torch.tensor([inputs]), token_type_ids=torch.tensor([sentence_embedding]))
start_index = torch.argmax(start_scores)
end_index = torch.argmax(end_scores)
answer = ' '.join(tokens[start_index:end_index+1])
The issue appears at line 13 of this code where I'm trying to get the maximum element in start_scores saying that this is not a tensor. When I tried printing this variable it showed "start_logits" as a string. Does anyone know a solution to this issue?
So after referring to the BERT Documentation we identified that the model output object contains multiple properties not only start & end scores. Thus, we applied the following changes to the code.
outputs = model(input_ids=torch.tensor([inputs]),token_type_ids=torch.tensor([sentence_embedding]))
start_index = torch.argmax(outputs.start_logits)
end_index = torch.argmax(outputs.end_logits)
answer = ' '.join(tokens[start_index:end_index+1])
Always refer to the documentation first :"D

Understanding introductory example on transformers in Trax

My goal is to understand the introductory example on transformers in Trax, which can be found at https://trax-ml.readthedocs.io/en/latest/notebooks/trax_intro.html:
import trax
# Create a Transformer model.
# Pre-trained model config in gs://trax-ml/models/translation/ende_wmt32k.gin
model = trax.models.Transformer(
input_vocab_size=33300,
d_model=512, d_ff=2048,
n_heads=8, n_encoder_layers=6, n_decoder_layers=6,
max_len=2048, mode='predict')
# Initialize using pre-trained weights.
model.init_from_file('gs://trax-ml/models/translation/ende_wmt32k.pkl.gz',
weights_only=True)
# Tokenize a sentence.
sentence = 'It is nice to learn new things today!'
tokenized = list(trax.data.tokenize(iter([sentence]), # Operates on streams.
vocab_dir='gs://trax-ml/vocabs/',
vocab_file='ende_32k.subword'))[0]
# Decode from the Transformer.
tokenized = tokenized[None, :] # Add batch dimension.
tokenized_translation = trax.supervised.decoding.autoregressive_sample(
model, tokenized, temperature=0.0) # Higher temperature: more diverse results.
# De-tokenize,
tokenized_translation = tokenized_translation[0][:-1] # Remove batch and EOS.
translation = trax.data.detokenize(tokenized_translation,
vocab_dir='gs://trax-ml/vocabs/',
vocab_file='ende_32k.subword')
print(translation)
The example works pretty fine. However, when I try to translate another example with the initialised model, e.g.
sentence = 'I would like to try another example.'
tokenized = list(trax.data.tokenize(iter([sentence]),
vocab_dir='gs://trax-ml/vocabs/',
vocab_file='ende_32k.subword'))[0]
tokenized = tokenized[None, :]
tokenized_translation = trax.supervised.decoding.autoregressive_sample(
model, tokenized, temperature=0.0)
tokenized_translation = tokenized_translation[0][:-1]
translation = trax.data.detokenize(tokenized_translation,
vocab_dir='gs://trax-ml/vocabs/',
vocab_file='ende_32k.subword')
print(translation)
I get the output !, on my local machine as well as on Google Colab. The same happens with other examples.
When I build and initialise a new model, everything works fine.
Is this a bug? If not, what is happening here and how can I avoid/fix that behaviour?
Tokenization and detokenization seem to work well, I debugged that. Things seem to go wrong/unexpected in trax.supervised.decoding.autoregressive_sample.
I found it out myself... one needs to reset the model's state. So the following code works for me:
def translate(model, sentence, vocab_dir, vocab_file):
empty_state = model.state # save empty state
tokenized_sentence = next(trax.data.tokenize(iter([sentence]), vocab_dir=vocab_dir,
vocab_file=vocab_file))
tokenized_translation = trax.supervised.decoding.autoregressive_sample(
model, tokenized_sentence[None, :], temperature=0.0)[0][:-1]
translation = trax.data.detokenize(tokenized_translation, vocab_dir=vocab_dir,
vocab_file=vocab_file)
model.state = empty_state # reset state
return translation
# Create a Transformer model.
# Pre-trained model config in gs://trax-ml/models/translation/ende_wmt32k.gin
model = trax.models.Transformer(input_vocab_size=33300, d_model=512, d_ff=2048, n_heads=8,
n_encoder_layers=6, n_decoder_layers=6, max_len=2048,
mode='predict')
# Initialize using pre-trained weights.
model.init_from_file('gs://trax-ml/models/translation/ende_wmt32k.pkl.gz',
weights_only=True)
print(translate(model, 'It is nice to learn new things today!',
vocab_dir='gs://trax-ml/vocabs/', vocab_file='ende_32k.subword'))
print(translate(model, 'I would like to try another example.',
vocab_dir='gs://trax-ml/vocabs/', vocab_file='ende_32k.subword'))

Create a model to classificy a sentence logical or not

As the title mention, how could I train a model to classify following sentences are logical or illogical?
“He has two legs”–logical
“He has six legs”–illogical
Solution I tried:
1 : Train the classifier by cnn
I have done it before, it works very well if you have enough of data. Problem is I do not have a huge data set which comes with “logical” or “illogical” labels for this case.
2 : Use language model
Train a language model introduced by gluonnlp on some data set like wiki, use it to find out the probability of the sentences. If the probability of the sentences are high, mark it as logical and vice versa. Problem is the results not good.
The way I estimate the probability
def __predict(self):
lines = self.__text_edit_input.toPlainText().split("\n")
result = ""
for line in lines:
result += str(self.__sentence_prob(line, 10)) + "\n"
self.__text_edit_output.setPlainText(result)
def __prepare_sentence(self, text, max_len):
result = mx.nd.zeros([max_len, 1], dtype='float32')
max_len = min(len(text), max_len)
i = max(max_len - len(text), 0)
j = 0
for index in range(i, max_len):
result[index][0] = self.__vocab[text[j]]
j = j + 1
return result
def __sentence_prob(self, text, max_len):
hiddens = self.__model.begin_state(1, func=mx.nd.zeros, ctx=self.__context)
tokens = self.__tokenizer(text)
data = self.__prepare_sentence(tokens, max_len)
output, _ = self.__model(data, hiddens)
prob = 0
for i in range(max_len):
total_prob = mx.nd.softmax(output[i][0])
prob += total_prob[self.__vocab[i]].asscalar()
return prob / max_len
Possible issues of language models:
1. Do not use correct way to split the sentences(I am using jieba to split the Chinese senteces)
2. Number of vocab is too small/big(test 10000, 15000 and 30000)
3. Loss too high(ppl around 190) after 50 epochs?
4. Number of sentences length should be larger/smaller(tried 10,20,35)
5. The data I use do not meet my requirements(not every sentences are logical)
6. Language model is not appropriate for this task?
Any suggestions?Thanks
Issue 6. Language model is not appropriate for this task? is the main problem. Language models are built to make sense of the input text with respect of language usage (syntax, semantics etc.) and not draw logical conclusions. So, you may not get good results even with a large amount of data or very deep models.
The problem you're trying to solve is extremely difficult. Something you may want to look at is Symbolic AI. There's a lot of ongoing research in this area.

How can I filter tf.data.Dataset by specific values?

I create a dataset by reading the TFRecords, I map the values and I want to filter the dataset for specific values, but since the result is a dict with tensors, I am not able to get the actual value of a tensor or to check it with tf.cond() / tf.equal. How can I do that?
def mapping_func(serialized_example):
feature = { 'label': tf.FixedLenFeature([1], tf.string) }
features = tf.parse_single_example(serialized_example, features=feature)
return features
def filter_func(features):
# this doesn't work
#result = features['label'] == 'some_label_value'
# neither this
result = tf.reshape(tf.equal(features['label'], 'some_label_value'), [])
return result
def main():
file_names = ["/var/data/file1.tfrecord", "/var/data/file2.tfrecord"]
dataset = tf.contrib.data.TFRecordDataset(file_names)
dataset = dataset.map(mapping_func)
dataset = dataset.shuffle(buffer_size=10000)
dataset = dataset.filter(filter_func)
dataset = dataset.repeat()
iterator = dataset.make_one_shot_iterator()
sample = iterator.get_next()
I am answering my own question. I found the issue!
What I needed to do is tf.unstack() the label like this:
label = tf.unstack(features['label'])
label = label[0]
before I give it to tf.equal():
result = tf.reshape(tf.equal(label, 'some_label_value'), [])
I suppose the problem was that the label is defined as an array with one element of type string tf.FixedLenFeature([1], tf.string), so in order to get the first and single element I had to unpack it (which creates a list) and then get the element with index 0, correct me if I'm wrong.
I think you don't need to make label a 1-dimensional array in the first place.
with:
feature = {'label': tf.FixedLenFeature((), tf.string)}
you won't need to unstack the label in your filter_func
Reading, filtering a dataset is very easy and there is no need to unstack anything.
to read the dataset:
print(my_dataset, '\n\n')
##let us print the first 3 records
for record in my_dataset.take(3):
##below could be large in case of image
print(record)
##let us print a specific key
print(record['key2'])
To filter is equally simple:
my_filtereddataset = my_dataset.filter(_filtcond1)
where you define _filtcond1 however you want. Let us say there is a 'true' 'false' boolean flag in your dataset, then:
#tf.function
def _filtcond1(x):
return x['key_bool'] == 1
or even a lambda function:
my_filtereddataset = my_dataset.filter(lambda x: x['key_int']>13)
If you are reading a dataset which you havent created or you are unaware of the keys (as seems to be the OPs case), you can use this to get an idea of the keys and structure first:
import json
from google.protobuf.json_format import MessageToJson
for raw_record in noidea_dataset.take(1):
example = tf.train.Example()
example.ParseFromString(raw_record.numpy())
##print(example) ##if image it will be toooolong
m = json.loads(MessageToJson(example))
print(m['features']['feature'].keys())
Now you can proceed with the filtering
You should try to use the apply function from
tf.data.TFRecordDataset tensorflow documentation
Otherwise... read this article about TFRecords to get a better knowledge about TFRecords TFRecords for humans
But the most likely situation is that you can not access neither modify a TFRecord...there is a request on github about this topic TFRecords request
My advice is to make the things as easy as you can...you have to know that you are you working with graph and sessions...
In any case...if everything fail try the part of the code that does not work in a tensorflow session as simple as you can do it...probably all these operations should be done when tf.session is running...

How to import word2vec into TensorFlow Seq2Seq model?

I am playing with Tensorflow sequence to sequence translation model. I was wondering if I could import my own word2vec into this model? Rather than using its original 'dense representation' mentioned in the tutorial.
From my point of view, it looks TensorFlow is using One-Hot representation for seq2seq model. Firstly,for function tf.nn.seq2seq.embedding_attention_seq2seq the encoder's input is a tokenized symbol, e.g. 'a' would be '4' and 'dog' would be '15715' etc. and requires num_encoder_symbols. So I think it makes me provide the position of the word and the total number of words, then the function could represent the word in One-Hot representation. I am still learning the source code, but it hard to understand.
Could anyone give me an idea on above problem?
The seq2seq embedding_* functions indeed create embedding matrices very similar to those from word2vec. They are a variable named sth like this:
EMBEDDING_KEY = "embedding_attention_seq2seq/RNN/EmbeddingWrapper/embedding"
Knowing this, you can just modify this variable. I mean -- get your word2vec vectors in some format, say a text file. Assuming you have your vocabulary in model.vocab you can then assign the read vectors in a way illustrated by the snippet below (it's just a snippet, you'll have to change it to make it work, but I hope it shows the idea).
vectors_variable = [v for v in tf.trainable_variables()
if EMBEDDING_KEY in v.name]
if len(vectors_variable) != 1:
print("Word vector variable not found or too many.")
sys.exit(1)
vectors_variable = vectors_variable[0]
vectors = vectors_variable.eval()
print("Setting word vectors from %s" % FLAGS.word_vector_file)
with gfile.GFile(FLAGS.word_vector_file, mode="r") as f:
# Lines have format: dog 0.045123 -0.61323 0.413667 ...
for line in f:
line_parts = line.split()
# The first part is the word.
word = line_parts[0]
if word in model.vocab:
# Remaining parts are components of the vector.
word_vector = np.array(map(float, line_parts[1:]))
if len(word_vector) != vec_size:
print("Warn: Word '%s', Expecting vector size %d, found %d"
% (word, vec_size, len(word_vector)))
else:
vectors[model.vocab[word]] = word_vector
# Assign the modified vectors to the vectors_variable in the graph.
session.run([vectors_variable.initializer],
{vectors_variable.initializer.inputs[1]: vectors})
I guess with the scope style, which Matthew mentioned, you can get variable:
with tf.variable_scope("embedding_attention_seq2seq"):
with tf.variable_scope("RNN"):
with tf.variable_scope("EmbeddingWrapper", reuse=True):
embedding = vs.get_variable("embedding", [shape], [trainable=])
Also, I would imagine you would want to inject embeddings into the decoder as well, the key (or scope) for it would be somthing like:
"embedding_attention_seq2seq/embedding_attention_decoder/embedding"
Thanks for your answer, Lukasz!
I was wondering, what exactly in the code snippet <b>model.vocab[word]</b> stands for? Just the position of the word in the vocabulary?
In this case wouldn't that be faster to iterate through the vocabulary and inject w2v vectors for the words that exist in w2v model.

Categories