Gensim Word2Vec Vocabulary: Unclear output

Gensim Word2Vec Vocabulary: Unclear output - python

I'm starting to get familiar with Word2Vec, but I'm struggeling with a problem and coudln't find something similar...
I want to use gensims Word2Vec on an imported PDF document (a book). To import I used PyPDF2 and stored the whole book into a list. Furthermore, I used gensims simple_preprocess in order to preprocess the data. This worked so far, I got the following output:
text=['schottky','diode','semiconductors',...]
So then I tried to use the Word2Vec:
from gensim.models import Word2Vec
model=Word2Vec(text, size=100, window=5, min_count=5, workers=4)
words=list(model.wv.vocab)
but the output was like this:
print(words)
['c','h','t','k','d',...]
I expected also the same words as in the text list and not just some characters. When I tried to find relations between words (e.g. 'schottky' and 'diode') I got the error-message that none of these words is included in the vocabulary.
My first thought was that the import is wrong, but I got the same result with textract instead of PyPDF2.
Does someone know what's the problem? Thanks for your help!
Appendix:
Importing the book
content_text=[]
number_of_inputs=len(os.listdir(path))
file_to_open=path
open_file=open(file_to_open,'rb')
read_pdf=PyPDF2.PdfFileReader(open_file)
number_of_pages=read_pdf.getNumPages()
page_content=""
for page_number in range(number_of_pages):
page = read_pdf.getPage(page_number)
page_content += page.extractText()
content_text.append(page_content)

Word2Vec requires as its sentences parameter a training corpus that is:
an iterable sequence (such as a list)
where each item is itself a list of string-tokens
If you supply just a list-of-strings, each string is seen as a list-of-one-character-strings, resulting in all the one-letter words you're seeing.
So, use a list-of-lists-of-words, more like:
[
['schottky','diode','semiconductors'],
]
(Note also that you generally won't get interesting Word2Vec results on tiny toy-sized data sets of just a few texts and just dozens to hundreds of words. You need many thousands of unique words, across many dozens of contrasting examples of each word, to induce the useful word-vector arrangements that Word2Vec is known for.)

Instead of
text=['schottky','diode','semiconductors']
Use this
text=[['schottky','diode','semiconductors']]
More info: Gensim word2vec

Related

Gensim - index_to_key gives numbers instead of words as output

I'm trying to get 5 most frequent words from word2vec model created from wikipedia dump. I turned this model into KeyedVectors and the code looks like that:
from gensim.models import KeyedVectors
def vocabulary():
model = KeyedVectors.load("vectors.kv")
result = model.index_to_key[:5]
print(result)
The result is:
[(507, 1), (858, 1), (785, 1), (251, 1), (9807, 1)]
On the other hand, when i'm trying the same with model made of tokenized text the result is:
['pdop', 'podany', 'wału', 'wytrzymałości', 'skręcanie']
Why am I getting numbers instead of words from the first model?
I used .get_texts() function, but I can't split the result of this function and pass it as sentences value to initiate the model, so I tried creating model with one short article, and then train it with data from .get_texts() article by article like this:
wiki_txt = wiki.get_texts()
model = Word2Vec.load('wiki_w2v.model')
for i in wiki_txt:
article = list(i)
model.train(article,total_examples = 1, epochs = 1)
Even if at this point article is a list[str], model still can't learn the data.

In a KeyedVectors from a Gensim Word2Vec model, the index_to_key property is usually a plain Python list, where each entry is a string word.
That yours are not this suggests something atypical happened during training. Perhaps the corpus that was fed to the Word2Vec model was made up of lists-of-tuples, rather than lists-of-string-tokens?
You should go back to the code which initially created & saved that vectors.kv file, with special attention to the corpus that was used for training. (What was its first element? Didi it look like a usual text-of-words – a list of strings – or something else?)
Updates based on comments below:
It looks like you are providing an literal instance of WikiCorpus to Word2Vec as the sentences argument. That's not the sort of corpus – a sequence where each item is a list-of-words – that Word2Vec expects.
The iterator returned by wiki.get_texts() is closer – it's the right format, a sequence-of-lists-of-words –but it only passes over the dump once, wherease Word2Vec needs something it can re-iterate over for multiple passes.
I recommend one of these three options:
Writing your own wrapper iterable class, which will call wiki.get_texts() again each time a new iteration is requested; or...
Writing some code to iterate through wiki.get_texts() once, writing each text into a giant plain-text file, with one article per line. Then, run Word2Vec using that file as input; or...
If you're one a giant-memory machine that can load the whole corpus as a list, you could do corpus_list = list(wiki.get_texts(), then pass that corpus_list as the training data to Word2Vec.
Separately:
If your only goal is doing a Word2Vec training, you don't really need to do the long-running expensive initial load of the WikiCorpus (`wiki = WikiCorpus('plwiki-latest-pages-articles.xml.bz2') then re-save. That does a time-consuming vocabulary survey of the dump that isn't even consulted by later word2vec training, which does its own survey.
Instead, when you need to use the dump, you can just do:
wiki = WikiCorpus('plwiki-latest-pages-articles.xml.bz2', dictionary=dict())
That leaves the wiki object ready for operations like .get_texts(), without wastefully creating a redundant dictionary.

How can I train a model with gensim lib and wikipedia2vec txt?

I'm trying to classify a dataset of files. In this dataset I have a colunm of texts and a column of labels. I want to build a model, based on wikipedia corpus, but I'm a little lost in the middle.
What I did so far...
I preprocessed my Text column (removing stopwords, whitespaces, deaccent, etc) and saved on a new csv file. Then I tagged using gensim lib:
def apply_preprocessing(fname, tokens_only=False):
with smart_open.open(fname) as f:
for i, line in enumerate(f):
tokens = gensim.utils.simple_preprocess(line)
if tokens_only:
yield tokens
else:
yield gensim.models.doc2vec.TaggedDocument(tokens, [i])
preprocessed_dataset = '/content/drive/MyDrive/dataset/preprocessed_dataset.csv'
preprocessed_dataset_train = list(apply_preprocessing(preprocessed_dataset))
This gives me an array of arrays of words contained in my Text column for each documment I have in my preprocessed_dataset.
I know that doing this loop I get each array of words:
for doc_id in range(len(preprocessed_dataset_train)):
preprocessed_dataset_train[doc_id].words
My goal is to give those words and "say": Based on this wikipedia pretrained embeddings (https://wikipedia2vec.github.io/wikipedia2vec/pretrained/), how similiar are one doc to another based on what you learn with this wikipedia corpus?
How do I use this pretrained wikipedia? This is already a file of words vectors, right? If so, How can I use it to analyse my preprocessed_dataset_train?
What's the next step should I do/understand to get to my goal?
I'm sorry for so many questions, when I think I'm understanding the road, I'm lost again and again.

Linear regression load model doesn't predict as expected

I have trained a linear regression model, with sklearn, for a 5 star rating and it's good enough. I have used Doc2vec to create my vectors, and saved that model. Then I save the linear regression model to another file. What I'm trying to do is load the Doc2vec model and linear regression model and try to predict another review.
There is something very strange about this prediction: whatever the input it always predicts around 2.1-3.0.
Thing is, I have a suggestion that it predicts around the average of 5 (which is 2.5 +/-) but this is not the case. I have printed when training the model the prediction value and the actual value of the test data and they range normally 1-5. So my idea is, that there is something wrong with the loading part of the code. This is my load code:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from bs4 import BeautifulSoup
from joblib import dump, load
import pickle
import re
model = Doc2Vec.load('../vectors/750000/doc2vec_model')
def cleanText(text):
text = BeautifulSoup(text, "lxml").text
text = re.sub(r'\|\|\|', r' ', text)
text = re.sub(r'http\S+', r'<URL>', text)
text = re.sub(r'[^\w\s]','',text)
text = text.lower()
text = text.replace('x', '')
return text
review = cleanText("Horrible movie! I don't recommend it to anyone!").split()
vector = model.infer_vector(review)
pkl_filename = "../vectors/750000/linear_regression_model.joblib"
with open(pkl_filename, 'rb') as file:
linreg = pickle.load(file)
review_vector = vector.reshape(1,-1)
predict_star = linreg.predict(review_vector)
print(predict_star)

Your example code shows imports of both joblib.dump and joblib.load – even though neither is used in this excerpt. And, the suffix of your file is suggestive that the model may have originally been saved with joblib.dump(), not vanilla pickle.
But, this code shows the file being loaded only via plain pickle.load() – which may be the source of the error.
The joblib.load() docs suggest that its load() may do things like load numpy arrays from multiple separate files created by its own dump(). (Oddly, the dump() docs are less clear on this, but supposedly dump() has a return-value that may be a list of filenames.)
You can check where the file was saved for extra files that appear to be related, and try using joblib.load() rather than plain-pickle, to see if that loads a more-functional/more-complete version of your linreg object.

(Update: I overlooked the .split() tokenization being done in the question code after .cleanText(), so this isn't the real problem. But keeping answer up for reference & because the real issue was discovered in the comments.)
Very commonly, users get mysteriously-weak results from Doc2Vec when they provide a plain string to infer_vector(). Doc2Vec infer_vector() requires a list-of-words, not a string.
If providing a string, the function will see it as a list-of-one-character words – per Python's modeling of strings as lists-of-characters, and type-conflation of characters and one-character-strings. Most of these one-character words probably aren't known by the model, and those that might be – 'i', 'a', etc – aren't very meaningful. So the inferred doc-vector will be weak & meaningless. (And, it isn't surprising such a vector, fed to your linear regression, always gives a middling predicted value.)
If you break the text into the expected list-of-words, your results should improve.
But more generally, the words provided to infer_vector() should be preprocessed and tokenized exactly however the training documents were.
(A fair sanity test of whether you're doing inference properly is to infer vectors for some of your training documents, then ask the Doc2Vec model for the doc-tags closest to these re-inferred vectors. In general, the same document's training-time tag/ID should be the top result, or at least one of the top few. If it isn't, there may be other problems in the data, model parameters, or inference.)

add new words to GoogleNews by gensim

I want to get word embeddings for the words in a corpus. I decide to use pretrained word vectors in GoogleNews by gensim library. But my corpus contains some words that are not in GoogleNews words. for these missing words, I want to use arithmatic mean of n most similar words to it in GoggoleNews words. First I load GoogleNews and check that the word "to" is in it?
#Load GoogleNews pretrained word2vec model
model=word2vec.KeyedVectors.Load_word2vec_format("GoogleNews-vectors-negative33.bin",binary=True)
print(model["to"])
I receive an error: keyError: "word 'to' not in vocabulary"
is it possible that such a large dataset doesn't have this word? this is true also for some other common word like "a"!
For adding missing words to word2vec model,first I want to get indices of words that are in GoogleNews. for missing words I have used index 0.
#obtain index of words
word_to_idx=OrderedDict({w:0 for w in corpus_words})
word_to_idx=OrderedDict({w:model.wv.vocab[w].index for w in corpus_words if w in model.wv.vocab})
then I calculate the mean of embedding vectors of most similar words to each missing word.
missing_embd={}
for key,value in word_to_idx.items():
if value==0:
similar_words=model.wv.most_similar(key)
similar_embeddings=[model.wv[a[0]] for a in similar_words]
missing_embd[key]=mean(similar_embeddings)
And then I add these news embeddings to word2vec model by:
for word,embd in missing_embd.items():
# model.wv.build_vocab(word,update=True)
model.wv.syn0[model.wv.vocab[word].index]=embd
There is an un-consistency. When I print missing_embed, it's empty. As if there were not any missing words.
But when I check it by this:
for w in tokens_lower:
if(w in model.wv.vocab)==False:
print(w)
print("***********")
I found a lot of missing words.
Now, I have 3 questions:
1- why missing_embed is empty while there are some missing words?
2- Is it possible that GoogleNews doesn't have words like "to"?
3- how can I append new embeddings to word2vec model? I used build_vocab and syn0. Thanks.

Here is a scenario where we are adding a missing lower case word.
from gensim.models import KeyedVectors
path = '../input/embeddings/GoogleNews-vectors-negative300/GoogleNews-vectors-negative300.bin'
embedding = KeyedVectors.load_word2vec_format(path, binary=True)
'Quoran' in embedding.vocab
Output : True
'quoran' in embedding.vocab
Output : False
Here Quoran is present but quoran in lower case is missing
# add quoran in lower case
embedding.add('quoran',embedding.get_vector('Quoran'),replace=False)
'quoran' in embedding.vocab
Output : True

It's possible Google removed common filler words like 'to' and 'a'. If the file seems otherwise uncorrupt, and checking other words after load() shows that they are present, it'd be reasonable to assume Google discarded the overly-common words as having such diffuse meaning as to be of low-value.
It's unclear and muddled what you're trying to do. You assign to word_to_idx twice - so only the second line matters.
(The first assignment, creating a dict where all words have a 0 value, has no lingering effect after the 2nd line creates an all-new dict, with only entries where w in model.wv.vocab. The only possible entry with a 0 after this step would be whatever word in the word-vectors set was already in position 0 – if and only if that word was also in your corpus_words.)
You seem to want to build new vectors for unknown words based on an average of similar words. However, the most_similar() only works for known-words. It will error if tried on a completely unknown word. So that approach can't work.
And a deeper problem is the gensim KeyedVectors class doesn't have support for dynamically adding new word->vector entries. You would have to dig into its source code and, to add one or a batch of new vectors, modify a bunch of its internal properties (including its vectors array, vocab dict, and index2entity list) in a self-consistent manner to have new entries.

Gensim's Doc2vec - inferred vector isn't similar

When I train Doc2vec (using Gensim's Doc2vec in Python) on corpus of about 10k documents (each has few hundred words) and then infer document vectors using the same documents, they are not at all similar to the trained document vectors. I would expect they would be at least somewhat similar.
That is I do model.docvecs['some_doc_id'] and model.infer_vector(documents['some_doc_id']).
Cosine distances between trained and inferred vectors for few first documents:
0.38277733326
0.284007549286
0.286488652229
0.173178792
0.370117008686
0.275438070297
0.377647638321
0.171194493771
0.350615143776
0.311795353889
0.342757165432
As you can see, they are not really similar. If the similarity is so terrible even for documents used for training, I can't even begin to try to infer unseen documents.
Training configuration:
model = Doc2Vec(documents=documents, dm=1, size=100, window=6, alpha=0.1, workers=4,
seed=44, sample=1e-5, iter=15, hs=0, negative=8, dm_mean=1, min_alpha=0.01, min_count=2)
Inferring:
model.infer_vector(tokens, steps=20, alpha=0.025)
Note on the side: Documents are always preprocessed the same way (I checked that the same list of tokens goes into training and into inferring).
Also I played with parameters around a bit, too, and results were similar. So if your suggestion would be something like "try increasing or decreasing this or that training parameter", I've most likely tried it. Maybe I just didn't come across the 'correct' parameters though.
Thanks for any suggestions as to what can I do to make it work better.
EDIT: I am willing and able to use any other available Python implementation of paragraph vectors (doc2vec). It doesn't have to be this one. If you know of another that can achieve better results.
EDIT: Minimal working example
import fnmatch
import os
from scipy.spatial.distance import cosine
from gensim.models import Doc2Vec
from gensim.models.doc2vec import TaggedDocument
from keras.preprocessing.text import text_to_word_sequence
files = {}
folder = 'some path' # each file contains few regular sentences
for f in fnmatch.filter(os.listdir(folder), '*.sent'):
files[f] = open(folder + '/' + f, 'r', encoding="UTF-8").read()
documents = []
for k, v in files.items():
words = text_to_word_sequence(v, lower=True) # converts string to list of words, removes commas etc.
documents.append(TaggedDocument(tags=[k], words=words))
d2 = Doc2Vec(size=200, documents=documents)
for doc in documents:
trained = d2.docvecs[doc.tags[0]]
inferred = d2.infer_vector(doc.words, steps=50)
print(cosine(trained, inferred)) # cosine similarity from scipy

What is the type of your documents object, and are you sure that it is a multiply-iterable object, so that the model can do all of its 16 passes over the set of TaggedDocument-shaped text examples? That is, does iter(documents) always return a fresh iterator, with all items as TaggedDocument-shaped objects with the right list-of-words in words and list-of-tags in tags? (A common error is to supply a corpus that can be iterated over only once, and then ignoring any logged hints/warnings that no real training has happening. The inference/similarity results from such a model will be essentially random.)
Then for infer_vector(), does documents[tag] really return just the list-of-words it expects (not TaggedDocument or string)? (Users often supply strings, rather than lists-of-tokens, for training or inference words and get results that are just noise.)
Was there evaluation-guided reason for changing various defaults, either a little (window=6, negative=8) or a lot (alpha=0.1, min_count=2)? Such tweaks may not be a major factor in your problem, and there's nothing magical about the class defaults. But until you have the basics working, it's best to stick close to common configuration. (And then even after the basics are working, limit changes to those that can be demonstrated as better via a repeatable scoring process.)
Some report needing much higher steps values – 100 or more – to get better inference results, though that would be most crucial for very-small documents (of a handful to couple dozen words) rather than the few-hundred-words documents you describe.
A corpus of 10k documents is on the small side for Paragraph Vectors (Doc2Vec), but with your smallish vector-size (100) and larger number of iterations (15), it might be workable.
If you're still having problems, you should expand your question with more code showing how documents works, some suggestive example documents, and your cosine-similarity evaluation process – to see if there are any oversights at each of those steps.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.