I am trying to understand, how to use BERT for QnA and found a tutorial on how to start on PyTorch (here). Now I would like to use these snippets to get started, but i do not understand how to project the output back on the example text.
text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]"
(...)
# Predict the start and end positions logits
with torch.no_grad():
start_logits, end_logits = questionAnswering_model(tokens_tensor, segments_tensors)
# Or get the total loss
start_positions, end_positions = torch.tensor([12]), torch.tensor([14])
multiple_choice_loss = questionAnswering_model(
tokens_tensor,
segments_tensors,
start_positions=start_positions,
end_positions=end_positions)
start_logits (shape: [1, 16]):tensor([[ 0.0196, 0.1578, 0.0848, 0.1333, -0.4113, -0.0241, -0.1060, -0.3649,
0.0955, -0.4644, -0.1548, 0.0967, -0.0659, 0.1055, -0.1488, -0.3649]])
end_logits (shape: [1, 16]):tensor([[ 0.1828, -0.2691, -0.0594, -0.1618, 0.0441, -0.2574, -0.2883, 0.2526,
-0.0551, -0.0051, -0.1572, -0.1670, -0.1219, -0.1831, -0.4463, 0.2526]])
If my assumption is correct start_logits and end_logits need to be projected back on text, but how do i compute this?
Additionally do you have any resources/guides/tutorials you could recommend to continue further into QnA (except google-research/bert github and the paper for bert)?
Thank you in advance.
I think you are trying to use Bert for Q&A where the answer is a span of the original text. The original paper uses this for the SQuAD dataset where this is the case. start_logits and end_logits have the logits for a token being the start/end of the answer so you can take argmax and it would be the index of the token in the text.
There is this NAACL tutorial on tranfer learning from the author of the repo you linked https://colab.research.google.com/drive/1iDHCYIrWswIKp-n-pOg69xLoZO09MEgf#scrollTo=qQ7-pH1Jp5EG It's using classification as the target task but you might still find it useful.
Related
My goal is to understand the introductory example on transformers in Trax, which can be found at https://trax-ml.readthedocs.io/en/latest/notebooks/trax_intro.html:
import trax
# Create a Transformer model.
# Pre-trained model config in gs://trax-ml/models/translation/ende_wmt32k.gin
model = trax.models.Transformer(
input_vocab_size=33300,
d_model=512, d_ff=2048,
n_heads=8, n_encoder_layers=6, n_decoder_layers=6,
max_len=2048, mode='predict')
# Initialize using pre-trained weights.
model.init_from_file('gs://trax-ml/models/translation/ende_wmt32k.pkl.gz',
weights_only=True)
# Tokenize a sentence.
sentence = 'It is nice to learn new things today!'
tokenized = list(trax.data.tokenize(iter([sentence]), # Operates on streams.
vocab_dir='gs://trax-ml/vocabs/',
vocab_file='ende_32k.subword'))[0]
# Decode from the Transformer.
tokenized = tokenized[None, :] # Add batch dimension.
tokenized_translation = trax.supervised.decoding.autoregressive_sample(
model, tokenized, temperature=0.0) # Higher temperature: more diverse results.
# De-tokenize,
tokenized_translation = tokenized_translation[0][:-1] # Remove batch and EOS.
translation = trax.data.detokenize(tokenized_translation,
vocab_dir='gs://trax-ml/vocabs/',
vocab_file='ende_32k.subword')
print(translation)
The example works pretty fine. However, when I try to translate another example with the initialised model, e.g.
sentence = 'I would like to try another example.'
tokenized = list(trax.data.tokenize(iter([sentence]),
vocab_dir='gs://trax-ml/vocabs/',
vocab_file='ende_32k.subword'))[0]
tokenized = tokenized[None, :]
tokenized_translation = trax.supervised.decoding.autoregressive_sample(
model, tokenized, temperature=0.0)
tokenized_translation = tokenized_translation[0][:-1]
translation = trax.data.detokenize(tokenized_translation,
vocab_dir='gs://trax-ml/vocabs/',
vocab_file='ende_32k.subword')
print(translation)
I get the output !, on my local machine as well as on Google Colab. The same happens with other examples.
When I build and initialise a new model, everything works fine.
Is this a bug? If not, what is happening here and how can I avoid/fix that behaviour?
Tokenization and detokenization seem to work well, I debugged that. Things seem to go wrong/unexpected in trax.supervised.decoding.autoregressive_sample.
I found it out myself... one needs to reset the model's state. So the following code works for me:
def translate(model, sentence, vocab_dir, vocab_file):
empty_state = model.state # save empty state
tokenized_sentence = next(trax.data.tokenize(iter([sentence]), vocab_dir=vocab_dir,
vocab_file=vocab_file))
tokenized_translation = trax.supervised.decoding.autoregressive_sample(
model, tokenized_sentence[None, :], temperature=0.0)[0][:-1]
translation = trax.data.detokenize(tokenized_translation, vocab_dir=vocab_dir,
vocab_file=vocab_file)
model.state = empty_state # reset state
return translation
# Create a Transformer model.
# Pre-trained model config in gs://trax-ml/models/translation/ende_wmt32k.gin
model = trax.models.Transformer(input_vocab_size=33300, d_model=512, d_ff=2048, n_heads=8,
n_encoder_layers=6, n_decoder_layers=6, max_len=2048,
mode='predict')
# Initialize using pre-trained weights.
model.init_from_file('gs://trax-ml/models/translation/ende_wmt32k.pkl.gz',
weights_only=True)
print(translate(model, 'It is nice to learn new things today!',
vocab_dir='gs://trax-ml/vocabs/', vocab_file='ende_32k.subword'))
print(translate(model, 'I would like to try another example.',
vocab_dir='gs://trax-ml/vocabs/', vocab_file='ende_32k.subword'))
I am writing a Question Answering system using pre-trained BERT with a linear layer and a softmax layer on top. When following the templates available on the net the labels of one example usually only consists of one answer_start_index and one answer_end_index. For example, from Huggingface when instantiating a SQUADFeatures object:
```
self.input_ids = input_ids
self.attention_mask = attention_mask
self.token_type_ids = token_type_ids
self.cls_index = cls_index
self.p_mask = p_mask
self.example_index = example_index
self.unique_id = unique_id
self.paragraph_len = paragraph_len
self.token_is_max_context = token_is_max_context
self.tokens = tokens
self.token_to_orig_map = token_to_orig_map
self.start_position = start_position
self.end_position = end_position
self.is_impossible = is_impossible
self.qas_id = qas_id
```
However, in my own dataset I have examples where the answer word is found at several locations in the context, i.e. there may be several correct spans constituting the answer.
My problem is I don't know how to manage such examples? In the templates available on the net labels are usually in a list, say:
[start_example1, start_example2, start_example3]
[end_example1, end_example2, end_example3]
In my case this may look like:
[start_example1, [start_example2_1, start_example2_2], start_example3]
and same for ends of course
In other words, I do not have a list containing one label per example, but a list containing either single-labels or a list of "labels" for an example, i.e. a list consisting of lists.
When following other templates the next step in the process is:
```
input_ids = torch.cat(input_ids, dim=0)
attention_masks = torch.cat(attention_masks, dim=0)
token_type_ids = torch.cat(token_type_ids, dim=0)
span_starts = torch(span_starts) #Something like this
span_ends = torch(span_ends) #Something like this
```
However this of course (?) raises an error as my span_start lists and span_end lists does not contain only single-items but sometimes a list within the list.
Anyone have an idea on how I can tackle this problem? Should I only use examples where there's only one span constituting the answer present in the context?
If I work around the torch-error, will the backpropagation / evaluation/ computation of loss still work?
Thank You! /B
Have you checked the code
from transformers import BertTokenizer, BertForQuestionAnswering
import torch
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"
encoding = tokenizer.encode_plus(question, text)
input_ids, token_type_ids = encoding["input_ids"], encoding["token_type_ids"]
start_scores, end_scores = model(torch.tensor([input_ids]), token_type_ids=torch.tensor([token_type_ids]))
all_tokens = tokenizer.convert_ids_to_tokens(input_ids)
answer = ' '.join(all_tokens[torch.argmax(start_scores) : torch.argmax(end_scores)+1])
assert answer == "a nice puppet"
I am not sure if this is the best way, but you may check instead of argmax to use topk, and check if this correspond to the correct answer.
t = torch.LongTensor([0,1,2,3,4,5,6,7,8,9])
t
_, indices = t.topk(4)
indices#([9, 8, 7, 6])
I have a dataset of 6000 observations; a sample of it is the following:
job_id job_title job_sector
30018141 Secondary Teaching Assistant Education
30006499 Legal Sales Assistant / Executive Sales
28661197 Private Client Practitioner Legal
28585608 Senior hydropower mechanical project manager Engineering
28583146 Warehouse Stock Checker - Temp / Immediate Start Transport & Logistics
28542478 Security Architect Contract IT & Telecoms
The goal is to predict the job sector of each row based on the job title.
Firstly, I apply some preprocessing on the job_title column:
def preprocess(document):
lemmatizer = WordNetLemmatizer()
stemmer_1 = PorterStemmer()
stemmer_2 = LancasterStemmer()
stemmer_3 = SnowballStemmer(language='english')
# Remove all the special characters
document = re.sub(r'\W', ' ', document)
# remove all single characters
document = re.sub(r'\b[a-zA-Z]\b', ' ', document)
# Substituting multiple spaces with single space
document = re.sub(r' +', ' ', document, flags=re.I)
# Converting to lowercase
document = document.lower()
# Tokenisation
document = document.split()
# Stemming
document = [stemmer_3.stem(word) for word in document]
document = ' '.join(document)
return document
df_first = pd.read_csv('../data.csv', keep_default_na=True)
for index, row in df_first.iterrows():
df_first.loc[index, 'job_title'] = preprocess(row['job_title'])
Then I do the following with Gensim and Doc2Vec:
X = df_first.loc[:, 'job_title'].values
y = df_first.loc[:, 'job_sector'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=0)
tagged_train = TaggedDocument(words=X_train.tolist(), tags=y_train.tolist())
tagged_train = list(tagged_train)
tagged_test = TaggedDocument(words=X_test.tolist(), tags=y_test.tolist())
tagged_test = list(tagged_test)
model = Doc2Vec(vector_size=5, min_count=2, epochs=30)
training_set = [TaggedDocument(sentence, tag) for sentence, tag in zip(X_train.tolist(), y_train.tolist())]
model.build_vocab(training_set)
model.train(training_set, total_examples=model.corpus_count, epochs=model.epochs)
test_set = [TaggedDocument(sentence, tag) for sentence, tag in zip(X_test.tolist(), y_test.tolist())]
predictors_train = []
for sentence in X_train.tolist():
sentence = sentence.split()
predictor = model.infer_vector(doc_words=sentence, steps=20, alpha=0.01)
predictors_train.append(predictor.tolist())
predictors_test = []
for sentence in X_test.tolist():
sentence = sentence.split()
predictor = model.infer_vector(doc_words=sentence, steps=20, alpha=0.025)
predictors_test.append(predictor.tolist())
sv_classifier = SVC(kernel='linear', class_weight='balanced', decision_function_shape='ovr', random_state=0)
sv_classifier.fit(predictors_train, y_train)
score = sv_classifier.score(predictors_test, y_test)
print('accuracy: {}%'.format(round(score*100, 1)))
However, the result which I am getting is 22% accuracy.
This makes me a lot suspicious especially because by using the TfidfVectorizer instead of the Doc2Vec (both with the same classifier) then I am getting 88% accuracy (!).
Therefore, I guess that I must be doing something wrong in how I apply the Doc2Vec of Gensim.
What is it and how can I fix it?
Or it it simply that my dataset is relatively small while more advanced methods such as word embeddings etc require way more data?
You don't mention the size of your dataset - in rows, total words, unique words, or unique classes. Doc2Vec works best with lots of data. Most published work trains on tens-of-thousands to millions of documents, of dozens to thousands of words each. (Your data appears to only have 3-5 words per document.)
Also, published work tends to train on data where every document has a unique-ID. It can sometimes make sense to use known-labels as tags instead of, or in addition to, unique-IDs. But it isn't necessarily a better approach. By using known-labels as the only tags, you're effectively only training one doc-vector per label. (It's essentially similar to concatenating all rows with the same tag into one document.)
You're inexplicably using fewer steps in inference than epochs in training - when in fact these are analogous values. In recent versions of gensim, inference will by default use the same number of inference epochs as the model was configured to use for training. And, it's more common to use more epochs during inference than training. (Also, you're inexplicably using different starting alpha values for inference for both classifier-training and classifier-testing.)
But the main problem is likely your choice of tiny size=5 doc vectors. Instead of the TfidfVectorizer, which will summarize each row as a vector of width equal to the unique-word count – perhaps hundreds or thousands of dimensions – your Doc2Vec model summarizes each document as just 5 values. You've essentially lobotomized Doc2Vec. Usual values here are 100-1000 – though if the dataset is tiny smaller sizes may be required.
Finally, the lemmatization/stemming may not be strictly necessary and may even be destructive. Lots of Word2Vec/Doc2Vec work doesn't bother to lemmatize/stem - often because there's plentiful data, with many appearances of all word forms.
These steps are most likely to help with smaller data, by making sure rarer word forms are combined with related longer forms to still get value from words that would otherwise be too rare to be retained (or get useful vectors).
But I can see many ways they might hurt for your domain. Manager and Management won't have exactly the same implications in this context, but could both be stemmed to manag. Similar for Security and Securities both becoming secur, and other words. I'd only perform these steps if you can prove through evaluation that they're helping. (Are the words passed to the TfidfVectorizer being lemmatized/stemmed?)
usually to train doc2vec/word2vec requires lots of generalised data (word2vec trained on 3 milian Wikipedia articles), since it's performing poorly on doc2vec consider experimenting with pre trained doc2vec refer this
Or you can try using word2vec and averaging it for entire document since word2vec gives vector for each word.
Let me know how this helps ?
The tools you are using are not suitable for classification. Id suggest you look into something like a char-rnn.
https://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html
This tutorial works on a similar problem, where it classifies names.
I'm training two GMM classifiers, each for one label, with MFCC values.
I concatenated all the MFCC values of a class and fitted into a classifier.
And for each classifier I sum the probability of its label's probability.
def createGMMClassifiers():
label_samples = {}
for label, sample in training.iteritems():
labelstack = np.empty((50,13))
for feature in sample:
#debugger.set_trace()
labelstack = np.concatenate((labelstack,feature))
label_samples[label]=labelstack
for label in label_samples:
#debugger.set_trace()
classifiers[label] = mixture.GMM(n_components = n_classes)
classifiers[label].fit(label_samples[label])
for sample in testing['happy']:
classify(sample)
def classify(testMFCC):
probability = {'happy':0,'sad':0}
for name, classifier in classifiers.iteritems():
prediction = classifier.predict_proba(testMFCC)
for probforlabel in prediction:
probability[name]+=probforlabel[0]
print 'happy ',probability['happy'],'sad ',probability['sad']
if(probability['happy']>probability['sad']):
print 'happy'
else:
print 'sad'
But my results does not seems to be consistent and I find it hard to believe it is because of the RandomSeed=None state since all the predictions is often are the same label for all test data, but each run it often gives the exact opposites(See Output 1 and output 2).
So my question is, am I doing something obviously wrong while training my classifier?
Output 1:
happy 123.559202732 sad 122.409167294
happy
happy 120.000879032 sad 119.883786657
happy
happy 124.000069307 sad 123.999928962
happy
happy 118.874574047 sad 118.920941127
sad
happy 117.441353421 sad 122.71924156
sad
happy 122.210579428 sad 121.997571901
happy
happy 120.981752603 sad 120.325940128
happy
happy 126.013713257 sad 125.885047394
happy
happy 122.776016525 sad 122.12320875
happy
happy 115.064172476 sad 114.999513909
happy
Output 2:
happy 123.559202732 sad 122.409167294
happy
happy 120.000879032 sad 119.883786657
happy
happy 124.000069307 sad 123.999928962
happy
happy 118.874574047 sad 118.920941127
sad
happy 117.441353421 sad 122.71924156
sad
happy 122.210579428 sad 121.997571901
happy
happy 120.981752603 sad 120.325940128
happy
happy 126.013713257 sad 125.885047394
happy
happy 122.776016525 sad 122.12320875
happy
happy 115.064172476 sad 114.999513909
happy
Earlier I asked a relevant question and got a correct answer. I'm providing the link below.
Having different results every run with GMM Classifier
Edit:
Added main function which collects the data and splits into training and testing
def main():
happyDir = dir+'happy/'
sadDir = dir+'sad/'
training["sad"]=[]
training["happy"]=[]
testing["happy"]=[]
#TestSet
for wavFile in os.listdir(happyDir)[::-1][:10]:
#print wavFile
fullPath = happyDir+wavFile
testing["happy"].append(sf.getFeatures(fullPath))
#TrainSet
for wavFile in os.listdir(happyDir)[::-1][10:]:
#print wavFile
fullPath = happyDir+wavFile
training["happy"].append(sf.getFeatures(fullPath))
for wavFile in os.listdir(sadDir)[::-1][10:]:
fullPath = sadDir+wavFile
training["sad"].append(sf.getFeatures(fullPath))
#Ensure the number of files in set
print "Test(Happy): ", len(testing['happy'])
print "Train(Happy): ", len(training['happy'])
createGMMClassifiers()
Edit 2:
Changed the code according to the answer. Still having similar kind of inconsistent results.
For classification tasks it's important to tune parameter given to the classifier, also a large number of classification algorithm follow chose theory, witch means if you change some parameter of the model briefly, you may get some huge different results. Also it's important to use different algorithms and not just use one algorithm for all classification tasks,
For this problem, you can try different classification algorithm to test that your data is good, and try different parameter with different values for each classifier, then you can determined that where is the problem.
One alternative way is to use Grid Search to explore and tune best parameter for specific classifier, read this: http://scikit-learn.org/stable/modules/grid_search.html
Your code has not much sense, you recreate classifier for every new training sample.
Correct code scheme for training should look about this:
label_samples = {}
classifiers = {}
# First we collect all samples per label into array of samples
for label, sample in samples:
label_samples[label].concatenate(sample)
# Then we train classifier on every label data
for label in label_samples:
classifiers[label] = mixture.GMM(n_components = n_classes)
classifiers[label].fit(label_samples[label])
Your decoding code is ok.
I am playing with Tensorflow sequence to sequence translation model. I was wondering if I could import my own word2vec into this model? Rather than using its original 'dense representation' mentioned in the tutorial.
From my point of view, it looks TensorFlow is using One-Hot representation for seq2seq model. Firstly,for function tf.nn.seq2seq.embedding_attention_seq2seq the encoder's input is a tokenized symbol, e.g. 'a' would be '4' and 'dog' would be '15715' etc. and requires num_encoder_symbols. So I think it makes me provide the position of the word and the total number of words, then the function could represent the word in One-Hot representation. I am still learning the source code, but it hard to understand.
Could anyone give me an idea on above problem?
The seq2seq embedding_* functions indeed create embedding matrices very similar to those from word2vec. They are a variable named sth like this:
EMBEDDING_KEY = "embedding_attention_seq2seq/RNN/EmbeddingWrapper/embedding"
Knowing this, you can just modify this variable. I mean -- get your word2vec vectors in some format, say a text file. Assuming you have your vocabulary in model.vocab you can then assign the read vectors in a way illustrated by the snippet below (it's just a snippet, you'll have to change it to make it work, but I hope it shows the idea).
vectors_variable = [v for v in tf.trainable_variables()
if EMBEDDING_KEY in v.name]
if len(vectors_variable) != 1:
print("Word vector variable not found or too many.")
sys.exit(1)
vectors_variable = vectors_variable[0]
vectors = vectors_variable.eval()
print("Setting word vectors from %s" % FLAGS.word_vector_file)
with gfile.GFile(FLAGS.word_vector_file, mode="r") as f:
# Lines have format: dog 0.045123 -0.61323 0.413667 ...
for line in f:
line_parts = line.split()
# The first part is the word.
word = line_parts[0]
if word in model.vocab:
# Remaining parts are components of the vector.
word_vector = np.array(map(float, line_parts[1:]))
if len(word_vector) != vec_size:
print("Warn: Word '%s', Expecting vector size %d, found %d"
% (word, vec_size, len(word_vector)))
else:
vectors[model.vocab[word]] = word_vector
# Assign the modified vectors to the vectors_variable in the graph.
session.run([vectors_variable.initializer],
{vectors_variable.initializer.inputs[1]: vectors})
I guess with the scope style, which Matthew mentioned, you can get variable:
with tf.variable_scope("embedding_attention_seq2seq"):
with tf.variable_scope("RNN"):
with tf.variable_scope("EmbeddingWrapper", reuse=True):
embedding = vs.get_variable("embedding", [shape], [trainable=])
Also, I would imagine you would want to inject embeddings into the decoder as well, the key (or scope) for it would be somthing like:
"embedding_attention_seq2seq/embedding_attention_decoder/embedding"
Thanks for your answer, Lukasz!
I was wondering, what exactly in the code snippet <b>model.vocab[word]</b> stands for? Just the position of the word in the vocabulary?
In this case wouldn't that be faster to iterate through the vocabulary and inject w2v vectors for the words that exist in w2v model.