How to slice string depending on length of tokens

How to slice string depending on length of tokens - python

When I use (with a long test_text and short question):
from transformers import BertTokenizer
import torch
from transformers import BertForQuestionAnswering
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
input_ids = tokenizer.encode(question, test_text)
print('Query has {:,} tokens.\n'.format(len(input_ids)))
sep_index = input_ids.index(tokenizer.sep_token_id)
num_seg_a = sep_index + 1
num_seg_b = len(input_ids) - num_seg_a
segment_ids = [0]*num_seg_a + [1]*num_seg_b
start_scores, end_scores = model(torch.tensor([input_ids]),
token_type_ids=torch.tensor([segment_ids]))
I get an error with the output
Token indices sequence length is longer than the specified maximum sequence length for this model (3 > 512). Running this sequence through the model will result in indexing errors
Query has 1,244 tokens.
How can I separate test_text into maximized length of chunks knowing that it won't exceed 512 tokens? And then ask the same question for each chunk of text, taking the best answer out of all of them, also going through the text twice with different slice points, in case the answer is cut during a slice.

Related

Type errors with BERT example

I'm new to BERT QA model & was trying to follow the example found in this article. The problem is when I run the code attached to the example it produces a Type error as follows TypeError: argmax(): argument 'input' (position 1) must be Tensor, not str.
Here is the code that I've tried running :
import torch
from transformers import BertForQuestionAnswering
from transformers import BertTokenizer
#Model
model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
#Tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
question = '''SAMPLE QUESTION"'''
paragraph = '''SAMPLE PARAGRAPH'''
encoding = tokenizer.encode_plus(text=question,text_pair=paragraph, add_special=True)
inputs = encoding['input_ids'] #Token embeddings
sentence_embedding = encoding['token_type_ids'] #Segment embeddings
tokens = tokenizer.convert_ids_to_tokens(inputs) #input tokens
start_scores, end_scores = model(input_ids=torch.tensor([inputs]), token_type_ids=torch.tensor([sentence_embedding]))
start_index = torch.argmax(start_scores)
end_index = torch.argmax(end_scores)
answer = ' '.join(tokens[start_index:end_index+1])
The issue appears at line 13 of this code where I'm trying to get the maximum element in start_scores saying that this is not a tensor. When I tried printing this variable it showed "start_logits" as a string. Does anyone know a solution to this issue?

So after referring to the BERT Documentation we identified that the model output object contains multiple properties not only start & end scores. Thus, we applied the following changes to the code.
outputs = model(input_ids=torch.tensor([inputs]),token_type_ids=torch.tensor([sentence_embedding]))
start_index = torch.argmax(outputs.start_logits)
end_index = torch.argmax(outputs.end_logits)
answer = ' '.join(tokens[start_index:end_index+1])
Always refer to the documentation first :"D

Multiple answer spans in context, BERT question answering

I am writing a Question Answering system using pre-trained BERT with a linear layer and a softmax layer on top. When following the templates available on the net the labels of one example usually only consists of one answer_start_index and one answer_end_index. For example, from Huggingface when instantiating a SQUADFeatures object:
```
self.input_ids = input_ids
self.attention_mask = attention_mask
self.token_type_ids = token_type_ids
self.cls_index = cls_index
self.p_mask = p_mask
self.example_index = example_index
self.unique_id = unique_id
self.paragraph_len = paragraph_len
self.token_is_max_context = token_is_max_context
self.tokens = tokens
self.token_to_orig_map = token_to_orig_map
self.start_position = start_position
self.end_position = end_position
self.is_impossible = is_impossible
self.qas_id = qas_id
```
However, in my own dataset I have examples where the answer word is found at several locations in the context, i.e. there may be several correct spans constituting the answer.
My problem is I don't know how to manage such examples? In the templates available on the net labels are usually in a list, say:
[start_example1, start_example2, start_example3]
[end_example1, end_example2, end_example3]
In my case this may look like:
[start_example1, [start_example2_1, start_example2_2], start_example3]
and same for ends of course
In other words, I do not have a list containing one label per example, but a list containing either single-labels or a list of "labels" for an example, i.e. a list consisting of lists.
When following other templates the next step in the process is:
```
input_ids = torch.cat(input_ids, dim=0)
attention_masks = torch.cat(attention_masks, dim=0)
token_type_ids = torch.cat(token_type_ids, dim=0)
span_starts = torch(span_starts) #Something like this
span_ends = torch(span_ends) #Something like this
```
However this of course (?) raises an error as my span_start lists and span_end lists does not contain only single-items but sometimes a list within the list.
Anyone have an idea on how I can tackle this problem? Should I only use examples where there's only one span constituting the answer present in the context?
If I work around the torch-error, will the backpropagation / evaluation/ computation of loss still work?
Thank You! /B

Have you checked the code
from transformers import BertTokenizer, BertForQuestionAnswering
import torch
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"
encoding = tokenizer.encode_plus(question, text)
input_ids, token_type_ids = encoding["input_ids"], encoding["token_type_ids"]
start_scores, end_scores = model(torch.tensor([input_ids]), token_type_ids=torch.tensor([token_type_ids]))
all_tokens = tokenizer.convert_ids_to_tokens(input_ids)
answer = ' '.join(all_tokens[torch.argmax(start_scores) : torch.argmax(end_scores)+1])
assert answer == "a nice puppet"
I am not sure if this is the best way, but you may check instead of argmax to use topk, and check if this correspond to the correct answer.
t = torch.LongTensor([0,1,2,3,4,5,6,7,8,9])
t
_, indices = t.topk(4)
indices#([9, 8, 7, 6])

Create a model to classificy a sentence logical or not

As the title mention, how could I train a model to classify following sentences are logical or illogical?
“He has two legs”–logical
“He has six legs”–illogical
Solution I tried:
1 : Train the classifier by cnn
I have done it before, it works very well if you have enough of data. Problem is I do not have a huge data set which comes with “logical” or “illogical” labels for this case.
2 : Use language model
Train a language model introduced by gluonnlp on some data set like wiki, use it to find out the probability of the sentences. If the probability of the sentences are high, mark it as logical and vice versa. Problem is the results not good.
The way I estimate the probability
def __predict(self):
lines = self.__text_edit_input.toPlainText().split("\n")
result = ""
for line in lines:
result += str(self.__sentence_prob(line, 10)) + "\n"
self.__text_edit_output.setPlainText(result)
def __prepare_sentence(self, text, max_len):
result = mx.nd.zeros([max_len, 1], dtype='float32')
max_len = min(len(text), max_len)
i = max(max_len - len(text), 0)
j = 0
for index in range(i, max_len):
result[index][0] = self.__vocab[text[j]]
j = j + 1
return result
def __sentence_prob(self, text, max_len):
hiddens = self.__model.begin_state(1, func=mx.nd.zeros, ctx=self.__context)
tokens = self.__tokenizer(text)
data = self.__prepare_sentence(tokens, max_len)
output, _ = self.__model(data, hiddens)
prob = 0
for i in range(max_len):
total_prob = mx.nd.softmax(output[i][0])
prob += total_prob[self.__vocab[i]].asscalar()
return prob / max_len
Possible issues of language models:
1. Do not use correct way to split the sentences(I am using jieba to split the Chinese senteces)
2. Number of vocab is too small/big(test 10000, 15000 and 30000)
3. Loss too high(ppl around 190) after 50 epochs?
4. Number of sentences length should be larger/smaller(tried 10,20,35)
5. The data I use do not meet my requirements(not every sentences are logical)
6. Language model is not appropriate for this task?
Any suggestions?Thanks

Issue 6. Language model is not appropriate for this task? is the main problem. Language models are built to make sense of the input text with respect of language usage (syntax, semantics etc.) and not draw logical conclusions. So, you may not get good results even with a large amount of data or very deep models.
The problem you're trying to solve is extremely difficult. Something you may want to look at is Symbolic AI. There's a lot of ongoing research in this area.

Gensim: raise KeyError("word '%s' not in vocabulary" % word)

I have this code and I have list of article as dataset. Each raw has an article.
I run this code:
import gensim
docgen = TokenGenerator( raw_documents, custom_stop_words )
# the model has 500 dimensions, the minimum document-term frequency is 20
w2v_model = gensim.models.Word2Vec(docgen, size=500, min_count=20, sg=1)
print( "Model has %d terms" % len(w2v_model.wv.vocab) )
w2v_model.save("w2v-model.bin")
# To re-load this model, run
#w2v_model = gensim.models.Word2Vec.load("w2v-model.bin")
def calculate_coherence( w2v_model, term_rankings ):
overall_coherence = 0.0
for topic_index in range(len(term_rankings)):
# check each pair of terms
pair_scores = []
for pair in combinations(term_rankings[topic_index], 2 ):
pair_scores.append( w2v_model.similarity(pair[0], pair[1]) )
# get the mean for all pairs in this topic
topic_score = sum(pair_scores) / len(pair_scores)
overall_coherence += topic_score
# get the mean score across all topics
return overall_coherence / len(term_rankings)
import numpy as np
def get_descriptor( all_terms, H, topic_index, top ):
# reverse sort the values to sort the indices
top_indices = np.argsort( H[topic_index,:] )[::-1]
# now get the terms corresponding to the top-ranked indices
top_terms = []
for term_index in top_indices[0:top]:
top_terms.append( all_terms[term_index] )
return top_terms
from itertools import combinations
k_values = []
coherences = []
for (k,W,H) in topic_models:
# Get all of the topic descriptors - the term_rankings, based on top 10 terms
term_rankings = []
for topic_index in range(k):
term_rankings.append( get_descriptor( terms, H, topic_index, 10 ) )
# Now calculate the coherence based on our Word2vec model
k_values.append( k )
coherences.append( calculate_coherence( w2v_model, term_rankings ) )
print("K=%02d: Coherence=%.4f" % ( k, coherences[-1] ) )
I face with this error:
raise KeyError("word '%s' not in vocabulary" % word)
KeyError: u"word 'business' not in vocabulary"
The original code works great with their data set.
https://github.com/derekgreene/topic-model-tutorial
Could you help what this error is?

It could help answerers if you included more of the information around the error message, such as the multiple-lines of call-frames that will clearly indicate which line of your code triggered the error.
However, if you receive the error KeyError: u"word 'business' not in vocabulary", you can trust that your Word2Vec instance, w2v_model, never learned the word 'business'.
This might be because it didn't appear in the training data the model was presented, or perhaps appeared but fewer than min_count times.
As you don't show the type/contents of your raw_documents variable, or code for your TokenGenerator class, it's not clear why this would have gone wrong – but those are the places to look. Double-check that raw_documents has the right contents, and that individual items inside the docgen iterable-object look like the right sort of input for Word2Vec.
Each item in the docgen iterable object should be a list-of-string-tokens, not plain strings or anything else. And, the docgen iterable must be possible of being iterated-over multiple times. For example, if you execute the following two lines, you should see the same two lists-of-string tokens (looking something like ['hello', 'world']:
print(iter(docgen).next())
print(iter(docgen).next())
If you see plain strings, docgen isn't providing the right kind of items for Word2Vec. If you only see one item printed, docgen is likely a simple single-pass iterator, rather than an iterable object.
You could also enable logging at the INFO level and watch the output during the Word2Vec step carefully, and pay extra attention to any numbers/steps that seem incongruous. (For example, do any steps indicate nothing is happening, or do the counts of words/text-examples seem off?)

Sentiment Classification with NLTK Naive Baysian classifier

I am implementing Naive Bayesian classifier with NLTK. But when i train classifier with extracted features it gives error "too many values to unpack". I am just beginner to python. Here is code. Program is reading text from files and extracting features from these files.
import nltk.classify.util,os,sys;
from nltk.classify import NaiveBayesClassifier;
from nltk.corpus import stopwords;
from nltk.tokenize import word_tokenize,RegexpTokenizer;
import re;
TAG_RE = re.compile(r'<[^>]+>')
def remove_tags(text):
return TAG_RE.sub('', text)
def word_feats(words):
return dict([(word,True) for word in words])
def feature_extractor(sentiment):
path = "train/"+sentiment+"/"
files = os.listdir(path);
feats = {};
i = 0;
for file in files:
f = open(path+file,"r", encoding='utf-8');
review = f.read();
review = remove_tags(review);
stopWords = (stopwords.words("english"))
tokenizer = RegexpTokenizer(r"\w+");
tokens = tokenizer.tokenize(review);
features = word_feats(tokens);
feats.update(features)
return feats;
posative_feat = feature_extractor("pos");
p = open("posFeat.txt","w", encoding='utf-8');
p.write(str(posative_feat));
negative_feat = feature_extractor("neg");
n = open("negFeat.txt","w", encoding='utf-8');
n.write(str(negative_feat));
plength = int(len(posative_feat)*3/4);
nlength = int(len(negative_feat)*3/4)
totalLength = plength+nlength;
trainFeatList = {}
testFeatList = {}
i = 0
for items in posative_feat.items():
i +=1;
value = {items[0]:items[1]}
if(i<plength):
trainFeatList.update(value);
else:
testFeatList.update(value);
j = 0
for items in negative_feat.items():
j +=1;
value = {items[0]:items[1]}
if(j<plength):
trainFeatList.update(value);
else:
testFeatList.update(value);
classifier = NaiveBayesClassifier.train(trainFeatList)
print(nltk.classify.util.accuracy(classifier,testFeatList));
classifier.show_most_informative_features();

looking at the NLTK book page http://www.nltk.org/book/ch06.html it seems the data that is given to the NaiveBayesClassifier is of the type list(tuple(dict,str)) whereas the data you are passing to the classifier is of the type list(dict).
If you represent the data in a similar manner, you will get different results. Basically, it is a list of (feature dict, label).
There are multiple errors in your code:
Python does not use a semicolon as a line ending
The True boolean does not seem to serve a purpose on line 12
trainFeatList and testFeatList should be lists
each value in your feature items list should betuple(dict,str)
assign labels to features in the list (in (4))
take NaiveBayesClassifier, and any use of classifier out of the negative features loop
If you fix the previous errors, the classifier will work, but unless I know what you are trying to achieve it is confusing and does not predict well.
the main line you need to pay attention to is when you assign something to your variable value.
for example:
value = {items[0]:items[1]}
should be something like:
value = ({feature_name:feature}, label)
Then afterwards you would call .append() on your lists to add each value instead of .update().
You can look at an example of your updated code in a buggy working state at http://pastebin.com/91Zu59Cm but I would suggest thinking about the following:
How is the data supposed to be represented for the NaiveBayesClassifier class?
What features are you trying to capture?
What labels are associated with those features?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.