Adding domain specific vocabulary to Huggingface pretrained tokenizer - python

I am using the hugginface transformers package and am loading a pretrained OpenAIGPTTokenizer tokenizer from a vocab file and merges file.
I want to add some domain-specific words to the vocabulary of this pretrained tokenizer (words such as "sms", "idk" and "lol")
When I tokenise a word that is currently in the vocabulary, the resulting token has a </w> symbol appended which I assume means end-of-word.
>>>tokenizer.tokenize("hello there lol")
["hello</w>", "there</w>", "lo", "l</w>"]
However, if I try and add a token to the tokenizer, it doesn't have a </w> symbol at the end. Although, at least the tokenizer recognises that "lol" is its own token.
>>>tokenizer.tokenize("hello there lol")
["hello</w>", "there</w>", "lol"]
I don't think this is the correct way to go about extending the vocabulary. What should I do?
Thanks

Related

Word2Vec empty word not in vocabulary

I'm currently required to work on a multilingual text classification model where I have to classify whether two sentences in two languages are semantically similar. I'm also required to use Word2Vec for word embedding.
I am able to generate the word embedding using Word2Vec, however, when I'm trying to convert my sentences to vectors with a method similar to this. I get an error saying
KeyError: "word '' not in vocabulary"
Here is my code snippet
import nltk
nltk.download('punkt')
tokenized_text_data = [nltk.word_tokenize(sub) for sub in concatenated_text]
model = Word2Vec(sentences=tokenized_text_data, min_count=1)
# Error happens here
train_vectors = [model.wv[re.split(" |;", row)] for row in concatenated_text]
For context, concatenated_text is the sentences from two languages concatenated together with semi-colon as the delimiter. Hence, why the function re.split(" |;").
I guess the important thing now is to understand why the error is telling me that an empty string '' is not in the vocabulary.
I did not provide the sentences cause the dataset is too big and I can't seem to find which word of which sentence is producing this error.
It turns out it was because of the delimiter that I concatenated myself all along. There are other semicolons in the sentence dataset, and with how re.split(" |;") works, it will split the sentence such as ice cream ; bread ; milk into a list of ['ice', 'cream', '', '', 'bread', '', '', 'milk']. Hence why the error word '' not in vocabulary.
I hope this would benefit someone in the future!

Using BERT to extract most similar words instead of word2vec for labeling functions

I am fairly new to BERT, and I wanted to test both approaches of using word2vec and BERT to extract most_similar words to a given word to pattern match in my labeling functions
I am currently using snorkel, one labeling function looks as so:
#labeling_function()
def lf_find_good_synonyms(x):
good_synonyms = word_vectors.most_similar("good", topn=25)
good_list = syn_list(good_synonyms)
return POSITIVE if any(word in x.stemmed for word in good_list) else ABSTAIN
This function basically looks for the word "good" or any of it's similar words in a sentence (the sentences are stemmed so are the words as the function syn_list returns the stem of each similar word), if found, the function will simply label the sentence as POSITIVE.
The issue here is that my word vectors are based on word2vec, and it's an old approach, I was wondering if I could use BERT instead and will it improve the performance much, since labeling functions are allowed to be lousy?

KeyError: word not in vocabulary" in word2vec

I wanted to convert some Japanese word to vector so that I can train the model for prediction. For that I downloaded pretrained models from Here.
import gensim
from sklearn.feature_extraction.text import CountVectorizer
from gensim.models import KeyedVectors
from gensim import models
from janome.tokenizer import Tokenizer
w2v_model = models.KeyedVectors.load_word2vec_format(w2v_models_path)
t = Tokenizer()
# I am testing for some random string
sentence = "社名公開求人住宅手当・家賃補助制度がある企業在宅勤務・リモートワーク可能な求人テレワークコロナに負けるな!積極採用中の企業特集リモートワーク可能なWebデザイナー求人"
tokens = [x.surface for x in t.tokenize(sentence)]
vectors = [w2v_model[v] for v in tokens]
In the last line, I am getting KeyError: "word 'テレワークコロナ' not in vocabulary"
Is there anything wrong here?
If you get a "not in vocabulary" error, you can trust that the token (word/key) that you've requested isn't in that KeyedVectors model.
You can see the full list of words known to your model (in the order they are stored) with w2v_model.key_to_index. (Or, just quick peek at some range of 20 in the middle as sanity-check with Python ranged-access like w2v_model.key_to_index[500:520].)
Are you sure 'テレワークコロナ' (and any other string giving the same error) is a legitimate, common Japanese word? Might the tokenizer be failing in some way? Are most of the words the tokenizer returns in the model?
It looks like the site you've linked has just copied those sets-of-word-vectors from Facebook's FastText work (https://fasttext.cc/docs/en/crawl-vectors.html). And, you're just using the plain text "word2vec_format" lists of vectors, so you only have the exact words in that file, and not the full FastText model - which also models word-fragments, and can thus 'guess' vectors for unknown words. (These guesses aren't very good – like working out a word's possible meaning from word-roots – but are usually better than nothing.)
I don't know if that approach works well for Japanese, but you could try it. If you instead grab the .bin (rather than text) file, and load it using Gensim's FastText support – specifically the load_facebook_vectors() method. You'll then get a special kind of KeyedVectors (FastTextKeyedVectors) that will give you such guesses for unknown words, which might help for your purposes (or not).

Word2Vec vocab results in just letters and symbols

I'm new to Word2Vec and I am trying to cluster words based on their similarity. To start I am using nltk to separate the sentences and then using the resulting list of sentences as the input into Word2Vec. However, when I print the vocab, it is just a bunch of letters, numbers and symbols rather than words. To be specific, an example of one of the letters is "< gensim.models.keyedvectors.Vocab object at 0x00000238145AB438>, 'L':"
# imports needed and logging
import gensim
from gensim.models import word2vec
import logging
import nltk
#nltk.download('punkt')
#nltk.download('averaged_perceptron_tagger')
with open('C:\\Users\\Freddy\\Desktop\\Thesis\\Descriptions.txt','r') as f_open:
text = f_open.read()
arr = []
sentences = nltk.sent_tokenize(text) # this gives a list of sentences
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',level=logging.INFO)
model = word2vec.Word2Vec(sentences, size = 300)
print(model.wv.vocab)
As the tutorial and the documentation for Word2Vec class suggests the constructor of the class requires list of lists of words as the first parameter (or iterator of iterators of words in general):
sentences (iterable of iterables, optional) – The sentences iterable can be simply a list of lists of tokens, but for larger
corpora,...
I believe before feeding in sentences into Word2Vec you need to use words_tokenize on each of the sentences changing the crucial line to:
sentences = [nltk.word_tokenize(sent) for sent in nltk.sent_tokenize(text)]
TL;DR
You get letters as your "words" because Word2Vec treats strings corresponding to sentences as iterables containing words. Iterating over strings results in the sequence of letters. These letters are used as the basis for the model learning (instead of intended words).
As the ancient saying goes: trash in - trash out.

Statistical sentence suggestion model like spell checking

There are already spell checking models available which help us to find the suggested correct spellings based on a corpus of trained correct spellings. Can the granularity be increased to "word" from alphabet so that we can have even phrase suggestions , such that if an incorrect phrase is entered then it should suggest the nearest correct phrase from the corpus of correct phrases, of course it is trained from a list of valid phrases.
Are there any python libraries which achieve this functionality already or how to proceed for this for an existing large gold standard phrase corpus to get statistically relevant suggestions?
Note: this is different from a spell checker as the alphabets in a spell checker are finite whereas in a phrase correcter the alphabet is itself a word hence theoretically infinite , but we can limit the number of words from a phrase bank.
What you want to build is a N-gram model which consist in computing the probability for each word to follow a sequence of n words.
You can use NLTK text corpora to train your model, or you can tokenize your own corpus with nltk.sent_tokenize(text) and nltk.word_tokenize(sentence).
You can consider 2-gram (Markov model):
What is the probability for "kitten" to follow "cute"?
...or 3-gram:
What is the probability for "kitten" to follow "the cute"?
etc.
Obviously training the model with n+1-gram is costlier than n-gram.
Instead of considering words, you can consider the couple (word, pos) where pos is the part-of-speech tag (you can get the tags using nltk.pos_tag(tokens))
You can also try to consider the lemmas instead of the words.
Here some interesting lectures about N-gram modelling:
Introduction to N-grams
Estimating N-gram Probabilities
This is a simple and short example of code (2-gram) not optimized:
from collections import defaultdict
import nltk
import math
ngram = defaultdict(lambda: defaultdict(int))
corpus = "The cat is cute. He jumps and he is happy."
for sentence in nltk.sent_tokenize(corpus):
tokens = map(str.lower, nltk.word_tokenize(sentence))
for token, next_token in zip(tokens, tokens[1:]):
ngram[token][next_token] += 1
for token in ngram:
total = math.log10(sum(ngram[token].values()))
ngram[token] = {nxt: math.log10(v) - total for nxt, v in ngram[token].items()}

Categories