add new words to GoogleNews by gensim - python

I want to get word embeddings for the words in a corpus. I decide to use pretrained word vectors in GoogleNews by gensim library. But my corpus contains some words that are not in GoogleNews words. for these missing words, I want to use arithmatic mean of n most similar words to it in GoggoleNews words. First I load GoogleNews and check that the word "to" is in it?
#Load GoogleNews pretrained word2vec model
model=word2vec.KeyedVectors.Load_word2vec_format("GoogleNews-vectors-negative33.bin",binary=True)
print(model["to"])
I receive an error: keyError: "word 'to' not in vocabulary"
is it possible that such a large dataset doesn't have this word? this is true also for some other common word like "a"!
For adding missing words to word2vec model,first I want to get indices of words that are in GoogleNews. for missing words I have used index 0.
#obtain index of words
word_to_idx=OrderedDict({w:0 for w in corpus_words})
word_to_idx=OrderedDict({w:model.wv.vocab[w].index for w in corpus_words if w in model.wv.vocab})
then I calculate the mean of embedding vectors of most similar words to each missing word.
missing_embd={}
for key,value in word_to_idx.items():
if value==0:
similar_words=model.wv.most_similar(key)
similar_embeddings=[model.wv[a[0]] for a in similar_words]
missing_embd[key]=mean(similar_embeddings)
And then I add these news embeddings to word2vec model by:
for word,embd in missing_embd.items():
# model.wv.build_vocab(word,update=True)
model.wv.syn0[model.wv.vocab[word].index]=embd
There is an un-consistency. When I print missing_embed, it's empty. As if there were not any missing words.
But when I check it by this:
for w in tokens_lower:
if(w in model.wv.vocab)==False:
print(w)
print("***********")
I found a lot of missing words.
Now, I have 3 questions:
1- why missing_embed is empty while there are some missing words?
2- Is it possible that GoogleNews doesn't have words like "to"?
3- how can I append new embeddings to word2vec model? I used build_vocab and syn0. Thanks.

Here is a scenario where we are adding a missing lower case word.
from gensim.models import KeyedVectors
path = '../input/embeddings/GoogleNews-vectors-negative300/GoogleNews-vectors-negative300.bin'
embedding = KeyedVectors.load_word2vec_format(path, binary=True)
'Quoran' in embedding.vocab
Output : True
'quoran' in embedding.vocab
Output : False
Here Quoran is present but quoran in lower case is missing
# add quoran in lower case
embedding.add('quoran',embedding.get_vector('Quoran'),replace=False)
'quoran' in embedding.vocab
Output : True

It's possible Google removed common filler words like 'to' and 'a'. If the file seems otherwise uncorrupt, and checking other words after load() shows that they are present, it'd be reasonable to assume Google discarded the overly-common words as having such diffuse meaning as to be of low-value.
It's unclear and muddled what you're trying to do. You assign to word_to_idx twice - so only the second line matters.
(The first assignment, creating a dict where all words have a 0 value, has no lingering effect after the 2nd line creates an all-new dict, with only entries where w in model.wv.vocab. The only possible entry with a 0 after this step would be whatever word in the word-vectors set was already in position 0 – if and only if that word was also in your corpus_words.)
You seem to want to build new vectors for unknown words based on an average of similar words. However, the most_similar() only works for known-words. It will error if tried on a completely unknown word. So that approach can't work.
And a deeper problem is the gensim KeyedVectors class doesn't have support for dynamically adding new word->vector entries. You would have to dig into its source code and, to add one or a batch of new vectors, modify a bunch of its internal properties (including its vectors array, vocab dict, and index2entity list) in a self-consistent manner to have new entries.

Related

Break down text into units of sense - text segmentation NLP Python

I have a dataframe text column (in french) and I want to split each text into sentences by their meaning ( break down text into units of sense ), any idea how to do it with Python libraries and NLP techniques ?!
P.S I tried NLTK sent_tokenize and word tokenize but it’s not well split respecting the meaning
For example:
“ text discussing sports and then economic and then school systems”
=> I want to break down the text into sentences like this:
sport related text
economic related text
school system related text
Or at least extract tags out of the whole text, so for this example: I’ll have the following tags:
sports/economic/school.
If I can achieve one of these two cases would be great
Unfortunately, based on my knowledge, there is not a straightforward answer to that.
However, there might be some workarounds, what I suggest is to apply transformers to the list of phrases that you have, something like this:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')
#Sentences we want to encode. Example:
sentence1 = ['This framework generates embeddings for each input sentence']
sentence2 = ['This is another phrase']
#Sentences are encoded by calling model.encode()
embedding = model.encode(sentence1, sentence2)
The result provide a list of dense space vectors representing each one of the sentences.
There is a list of different transformers models on https://huggingface.co/sentence-transformers they differ on the language of application, parameters and performances.
After that I would set a list of word that may used as tags, those may be arbitraly chosen or otherwise took from the corpus itself based on the top N most frequent words (some pre-processing of the text might be necessary in this case).
With NLTK library it should look like this:
import nltk
words= nltk.tokenize.word_tokenize(corpus)
stopwords = nltk.corpus.stopwords.words('english') #remove stopwords
common_words_without_stopwords = nltk.FreqDist(w.lower() for w in words if w not in stopwords)
mostCommon= common_words.most_common(10).keys() #top 10 most common words, that you can use as tags
Then, I would convert also those tags into vectors and eventually iterate a for loop cycle that aims to identify the top N tags that are the closest to the single phrase.
For what concerns the latter, there are different ways to compute the distance of 2 dense vectors, like: Euclidean distance, cosine similarity, l-norm.. It's a matter of choice.
For example, with euclidean distance:
import numpy as np
# calculating Euclidean distance
# using linalg.norm()
dist = np.linalg.norm(embedding[0]- embedding[1])

How can I train a model with gensim lib and wikipedia2vec txt?

I'm trying to classify a dataset of files. In this dataset I have a colunm of texts and a column of labels. I want to build a model, based on wikipedia corpus, but I'm a little lost in the middle.
What I did so far...
I preprocessed my Text column (removing stopwords, whitespaces, deaccent, etc) and saved on a new csv file. Then I tagged using gensim lib:
def apply_preprocessing(fname, tokens_only=False):
with smart_open.open(fname) as f:
for i, line in enumerate(f):
tokens = gensim.utils.simple_preprocess(line)
if tokens_only:
yield tokens
else:
yield gensim.models.doc2vec.TaggedDocument(tokens, [i])
preprocessed_dataset = '/content/drive/MyDrive/dataset/preprocessed_dataset.csv'
preprocessed_dataset_train = list(apply_preprocessing(preprocessed_dataset))
This gives me an array of arrays of words contained in my Text column for each documment I have in my preprocessed_dataset.
I know that doing this loop I get each array of words:
for doc_id in range(len(preprocessed_dataset_train)):
preprocessed_dataset_train[doc_id].words
My goal is to give those words and "say": Based on this wikipedia pretrained embeddings (https://wikipedia2vec.github.io/wikipedia2vec/pretrained/), how similiar are one doc to another based on what you learn with this wikipedia corpus?
How do I use this pretrained wikipedia? This is already a file of words vectors, right? If so, How can I use it to analyse my preprocessed_dataset_train?
What's the next step should I do/understand to get to my goal?
I'm sorry for so many questions, when I think I'm understanding the road, I'm lost again and again.

Extract token frequencies from gensim model

Questions like 1 and 2 give answers for retrieving vocabulary frequencies from gensim word2vec models.
For some reason, they actually just give a deprecating counter from n (size of vocab) to 0, alongside the most frequent tokens, ordered.
For example:
for idx, w in enumerate(model.vocab):
print(idx, w, model.vocab[w].count)
Gives:
0 </s> 111051
1 . 111050
2 , 111049
3 the 111048
4 of 111047
...
111050 tokiwa 2
111051 muzorewa 1
Why is it doing this? How can I extract term frequencies from the model, given a word?
Those answers are correct for reading the declared token-counts out of a model which has them.
But in some cases, your model may only have been initialized with a fake, descending-by-1 count for each word. This is most likely, in using Gensim, if it was loaded from a source where either the counts weren't available, or weren't used.
In particular, if you created the model using load_word2vec_format(), that simple vectors-only format (whether binary or plain-text) inherently contains no word counts. But such words are almost always, by convention, sorted in most-frequent to least-frequent order.
So, Gensim has chosen, when frequencies are not present, to synthesize fake counts, with linearly descending int values, where the (first) most-frequent word begins with the count of all unique words, and the (last) least-frequent word has a count of 1.
(I'm not sure this is a good idea, but Gensim's been doing it for a while, and it ensures code relying on the per-token count won't break, and will preserve the original order, though obviously not the unknowable original true-proportions.)
In some cases, the original source of the file may have saved a separate .vocab file with the word-frequencies alongside the word2vec_format vectors. (In Google's original word2vec.c code release, this is the file generated by the optional -save-vocab flag. In Gensim's .save_word2vec_format() method, the optional fvocab parameter can be used to generate this side file.)
If so, that 'vocab' frequencies filename may be supplied, when you call .load_word2vec_format(), as the fvocab parameter - and then your vector-set will have true counts.
If you word-vectors were originally created in Gensim from a corpus giving actual frequencies, and were always saved/loaded using the Gensim native functions .save()/.load() which use an extended form of Python-pickling, then the original true count info will never have been lost.
If you've lost the original frequency data, but you know the data was from a real natural-language source, and you want a more realistic (but still faked) set of frequencies, an option could be to use the Zipfian distribution. (Real natural-language usage frequencies tend to roughly fit this 'tall head, long tail' distribution.) A formula for creating such more-realistic dummy counts is available in the answer:
Gensim: Any chance to get word frequency in Word2Vec format?

Sentence matching with gensim word2vec: manually populated model doesn't work

I'm trying to solve a problem of sentence comparison using naive approach of summing up word vectors and comparing the results. My goal is to match people by interest, so the dataset consists of names and short sentences describing their hobbies. The batches are fairly small, few hundreds of people, so i wanted to give it a try before digging into doc2vec.
I prepare the data by cleaning it completely, removing stop words, tokenizing and lemmatizing. I use pre-trained model for word vectors which returns adequate results when finding similarities for some test words. Also tried summing up the sentence words to find similarities in the original model - the matches do make sense. The similarities would be around general sense of the phrase.
For sentence matching I'm trying the following: create an empty model
b = gs.models.Word2Vec(min_count=1, size=300, sample=0, hs=0)
Build vocab out of names (or person id's), no training
#first create vocab with an empty vector
test = [['test']]
b.build_vocab(test)
b.wv.syn0[b.wv.vocab['test'].index] = b.wv.syn0[b.wv.vocab['test'].index]*0
#populate vocab from an array
b.build_vocab([personIds], update=True)
Summ each sentence's word vectors and store the results into the model for a corresponding id
#sentences are pulled from pandas dataset df. 'a' is a pre-trained model i use to get vectors for each word
def summ(phrase, start_model):
'''
vector addition function
'''
#starting with a vector of 0's
sum_vec = start_model.word_vec("cat_NOUN")*0
for word in phrase:
sum_vec += start_model.word_vec(word)
return sum_vec
for i, row in df.iterrows():
try:
personId = row["ID"]
summVec = summ(df.iloc[i,1],a)
#updating syn0 for each name/id in vocabulary
b.wv.syn0[b.wv.vocab[personId].index] = summVec
except:
pass
I understand that i shouldn't be expecting much accuracy here, but the t-SNE print doesn't show any clustering whatsoever. Finding similarities method also fails to find matches (<0.2 similarity coefficient basically for everything). [
Wondering if anyone has an idea of where did i go wrong? Is my approach valid at all?
Your code, as shown, neither does any train() of word-vectors (using your local text), nor does it pre-load any vectors from elsewhere. So any vectors which do exist – created by the build_vocab() calls – will still just be in their randomly-initialized starting locations, and be useless for any semantic purposes.
Suggestions:
either (a) train your own vectors from your text, which makes sense if you have a good quantity of text; or (b) load vectors from elsewhere. But don't try to do both. (Or, in the case of the code above, neither.)
The update=True option for build_vocab() should be considered an expert, experimental option – only worth tinkering with if you've already had things working in simpler modes, and you're sure you need it and understand all the implications.
Normal use won't ever explicitly re-assign new values into the Word2Vec model's syn0 property - those are managed by the class's training routines, so you never need to zero them out or modify them. You should tally up your own text summary vectors, based on word-vectors, outside the model in your own data structures.

Count vectorizing into bigrams for one document, and then taking the average

I'm trying to write a function that takes in one document, count vectorizes the bigrams for that document. This shouldn't have any zeroes, as I'm only doing this to one document at a time. Then I want to take the average of those numbers to get a sense of bigram repetition.
Any problems with this code?
def avg_bigram(x):
bigram_vectorizer = CountVectorizer(stop_words='english', ngram_range=(2,2))
model = bigram_vectorizer.fit_transform(x)
vector = model.toarray()
return vector.mean()
I've tested it with text that I know contains more than stop words, and I get back
"empty vocabulary; perhaps the documents only contain stop words"
Thank you for any help!
CountVectorizer expects a corpus, while you are giving a single doc. Just wrap your doc in a list. E.g:
model = bigram_vectorizer.fit_transform([x])

Categories