Gensim - index_to_key gives numbers instead of words as output - python

I'm trying to get 5 most frequent words from word2vec model created from wikipedia dump. I turned this model into KeyedVectors and the code looks like that:
from gensim.models import KeyedVectors
def vocabulary():
model = KeyedVectors.load("vectors.kv")
result = model.index_to_key[:5]
print(result)
The result is:
[(507, 1), (858, 1), (785, 1), (251, 1), (9807, 1)]
On the other hand, when i'm trying the same with model made of tokenized text the result is:
['pdop', 'podany', 'wału', 'wytrzymałości', 'skręcanie']
Why am I getting numbers instead of words from the first model?
I used .get_texts() function, but I can't split the result of this function and pass it as sentences value to initiate the model, so I tried creating model with one short article, and then train it with data from .get_texts() article by article like this:
wiki_txt = wiki.get_texts()
model = Word2Vec.load('wiki_w2v.model')
for i in wiki_txt:
article = list(i)
model.train(article,total_examples = 1, epochs = 1)
Even if at this point article is a list[str], model still can't learn the data.

In a KeyedVectors from a Gensim Word2Vec model, the index_to_key property is usually a plain Python list, where each entry is a string word.
That yours are not this suggests something atypical happened during training. Perhaps the corpus that was fed to the Word2Vec model was made up of lists-of-tuples, rather than lists-of-string-tokens?
You should go back to the code which initially created & saved that vectors.kv file, with special attention to the corpus that was used for training. (What was its first element? Didi it look like a usual text-of-words – a list of strings – or something else?)
Updates based on comments below:
It looks like you are providing an literal instance of WikiCorpus to Word2Vec as the sentences argument. That's not the sort of corpus – a sequence where each item is a list-of-words – that Word2Vec expects.
The iterator returned by wiki.get_texts() is closer – it's the right format, a sequence-of-lists-of-words –but it only passes over the dump once, wherease Word2Vec needs something it can re-iterate over for multiple passes.
I recommend one of these three options:
Writing your own wrapper iterable class, which will call wiki.get_texts() again each time a new iteration is requested; or...
Writing some code to iterate through wiki.get_texts() once, writing each text into a giant plain-text file, with one article per line. Then, run Word2Vec using that file as input; or...
If you're one a giant-memory machine that can load the whole corpus as a list, you could do corpus_list = list(wiki.get_texts(), then pass that corpus_list as the training data to Word2Vec.
Separately:
If your only goal is doing a Word2Vec training, you don't really need to do the long-running expensive initial load of the WikiCorpus (`wiki = WikiCorpus('plwiki-latest-pages-articles.xml.bz2') then re-save. That does a time-consuming vocabulary survey of the dump that isn't even consulted by later word2vec training, which does its own survey.
Instead, when you need to use the dump, you can just do:
wiki = WikiCorpus('plwiki-latest-pages-articles.xml.bz2', dictionary=dict())
That leaves the wiki object ready for operations like .get_texts(), without wastefully creating a redundant dictionary.

Related

is there a way to stop creation of vocabulary in gensim.WikiCorpus when reach 2000000 tokens?

I downloaded the latest wiki dump multi-stream bz2. I call the WikiCorpus class from gensim corpora and after 90000 document the vocabulary reaches the highest value (2000000 tokens).
I got this in terminal:
keeping 2000000 tokens which were in no less than 0 and no more than 580000 (=100.0%) documents
resulting dictionary: Dictionary(2000000 unique tokens: ['ability', 'able', 'abolish', 'abolition', 'about']...)
adding document #580000 to Dictionary(2000000 unique tokens: ['ability', 'able', 'abolish', 'abolition', 'about']...)
The WikiCorpus class continues to work until the end of the documents in my bz2.
Is there a way to stop it? or to split the bz2 file in a sample?
thanks for help!
There's no specific parameter to cap the number of tokens. But when you use WikiCorpus.get_texts(), you don't have to read them all: you can stop at any time.
If, as suggested by another question of yours, you plan to use the article texts for Gensim Word2Vec (or a similar model), you don't need the constructor to do its own expensive full-scan vocabulary-discovery. If you supply any dummy object (such as an empty dict) as the optional dictionary parameter, it'll skip this unnecessary step. EG:
wiki_corpus = WikiCorpus(filename, dictionary={})
If you also want to use some truncated version of the full set of articles, I'd suggest manually iterating over just a subset of the articles. For example if the subset will easily fit as a list in RAM, say 50000 articles, that's as simple as:
import itertools
subset_corpus = list(itertools.islice(wiki_corpus, 50000))
If you want to create a subset larger than RAM, iterate over the set number of articles, writing their tokenized texts to a scratch text file, one per line. Then use that file as your later input. (By spending the WikiCorpus extraction/tokenization effort only once, then reusing the file from disk, this can sometimes offer a performance boost even if you don't need to do it.)

BucketIterator not returning batches of correct size

I'm implementing a simple LSTM language model in PyTorch, and wanted to check out the BucketIterator that is provided by torchtext.
It turns out that the batch that is returned has the size of my entire corpus, so I must be doing something wrong during its initialisation.
I've already got the BPTTIterator working, but as I want to be able to train on batches of complete sentences as well, I thought the BucketIterator should be the way to go.
I use the following setup, with my corpus a simple txt file containing sentences at each line.
field = Field(use_vocab=True, batch_first=True)
corpus = PennTreebank('project_2_data/train_lines.txt', field)
field.build_vocab(corpus)
iterator = BucketIterator(corpus,
batch_size=64,
repeat=False,
sort_key=lambda x: len(x.text),
sort_within_batch=True,
)
I expect a batch from this iterator to have the shape (batch_size, max_len), but it appends the entire corpus into 1 tensor of shape (1, corpus_size).
What am I missing in my setup?
Edit: it seems the PennTreebank object is not compatible with a BucketIterator (it contains only 1 Example as noted here http://mlexplained.com/2018/02/15/language-modeling-tutorial-in-torchtext-practical-torchtext-part-2/). Using a TabularDataset with only 1 Field got it working.
If someone has an idea how language modelling with padded sentence batches can be done in torchtext in a more elegant manner I'd love to hear it!

Linear regression load model doesn't predict as expected

I have trained a linear regression model, with sklearn, for a 5 star rating and it's good enough. I have used Doc2vec to create my vectors, and saved that model. Then I save the linear regression model to another file. What I'm trying to do is load the Doc2vec model and linear regression model and try to predict another review.
There is something very strange about this prediction: whatever the input it always predicts around 2.1-3.0.
Thing is, I have a suggestion that it predicts around the average of 5 (which is 2.5 +/-) but this is not the case. I have printed when training the model the prediction value and the actual value of the test data and they range normally 1-5. So my idea is, that there is something wrong with the loading part of the code. This is my load code:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from bs4 import BeautifulSoup
from joblib import dump, load
import pickle
import re
model = Doc2Vec.load('../vectors/750000/doc2vec_model')
def cleanText(text):
text = BeautifulSoup(text, "lxml").text
text = re.sub(r'\|\|\|', r' ', text)
text = re.sub(r'http\S+', r'<URL>', text)
text = re.sub(r'[^\w\s]','',text)
text = text.lower()
text = text.replace('x', '')
return text
review = cleanText("Horrible movie! I don't recommend it to anyone!").split()
vector = model.infer_vector(review)
pkl_filename = "../vectors/750000/linear_regression_model.joblib"
with open(pkl_filename, 'rb') as file:
linreg = pickle.load(file)
review_vector = vector.reshape(1,-1)
predict_star = linreg.predict(review_vector)
print(predict_star)
Your example code shows imports of both joblib.dump and joblib.load – even though neither is used in this excerpt. And, the suffix of your file is suggestive that the model may have originally been saved with joblib.dump(), not vanilla pickle.
But, this code shows the file being loaded only via plain pickle.load() – which may be the source of the error.
The joblib.load() docs suggest that its load() may do things like load numpy arrays from multiple separate files created by its own dump(). (Oddly, the dump() docs are less clear on this, but supposedly dump() has a return-value that may be a list of filenames.)
You can check where the file was saved for extra files that appear to be related, and try using joblib.load() rather than plain-pickle, to see if that loads a more-functional/more-complete version of your linreg object.
(Update: I overlooked the .split() tokenization being done in the question code after .cleanText(), so this isn't the real problem. But keeping answer up for reference & because the real issue was discovered in the comments.)
Very commonly, users get mysteriously-weak results from Doc2Vec when they provide a plain string to infer_vector(). Doc2Vec infer_vector() requires a list-of-words, not a string.
If providing a string, the function will see it as a list-of-one-character words – per Python's modeling of strings as lists-of-characters, and type-conflation of characters and one-character-strings. Most of these one-character words probably aren't known by the model, and those that might be – 'i', 'a', etc – aren't very meaningful. So the inferred doc-vector will be weak & meaningless. (And, it isn't surprising such a vector, fed to your linear regression, always gives a middling predicted value.)
If you break the text into the expected list-of-words, your results should improve.
But more generally, the words provided to infer_vector() should be preprocessed and tokenized exactly however the training documents were.
(A fair sanity test of whether you're doing inference properly is to infer vectors for some of your training documents, then ask the Doc2Vec model for the doc-tags closest to these re-inferred vectors. In general, the same document's training-time tag/ID should be the top result, or at least one of the top few. If it isn't, there may be other problems in the data, model parameters, or inference.)

Gensim's Doc2vec - inferred vector isn't similar

When I train Doc2vec (using Gensim's Doc2vec in Python) on corpus of about 10k documents (each has few hundred words) and then infer document vectors using the same documents, they are not at all similar to the trained document vectors. I would expect they would be at least somewhat similar.
That is I do model.docvecs['some_doc_id'] and model.infer_vector(documents['some_doc_id']).
Cosine distances between trained and inferred vectors for few first documents:
0.38277733326
0.284007549286
0.286488652229
0.173178792
0.370117008686
0.275438070297
0.377647638321
0.171194493771
0.350615143776
0.311795353889
0.342757165432
As you can see, they are not really similar. If the similarity is so terrible even for documents used for training, I can't even begin to try to infer unseen documents.
Training configuration:
model = Doc2Vec(documents=documents, dm=1, size=100, window=6, alpha=0.1, workers=4,
seed=44, sample=1e-5, iter=15, hs=0, negative=8, dm_mean=1, min_alpha=0.01, min_count=2)
Inferring:
model.infer_vector(tokens, steps=20, alpha=0.025)
Note on the side: Documents are always preprocessed the same way (I checked that the same list of tokens goes into training and into inferring).
Also I played with parameters around a bit, too, and results were similar. So if your suggestion would be something like "try increasing or decreasing this or that training parameter", I've most likely tried it. Maybe I just didn't come across the 'correct' parameters though.
Thanks for any suggestions as to what can I do to make it work better.
EDIT: I am willing and able to use any other available Python implementation of paragraph vectors (doc2vec). It doesn't have to be this one. If you know of another that can achieve better results.
EDIT: Minimal working example
import fnmatch
import os
from scipy.spatial.distance import cosine
from gensim.models import Doc2Vec
from gensim.models.doc2vec import TaggedDocument
from keras.preprocessing.text import text_to_word_sequence
files = {}
folder = 'some path' # each file contains few regular sentences
for f in fnmatch.filter(os.listdir(folder), '*.sent'):
files[f] = open(folder + '/' + f, 'r', encoding="UTF-8").read()
documents = []
for k, v in files.items():
words = text_to_word_sequence(v, lower=True) # converts string to list of words, removes commas etc.
documents.append(TaggedDocument(tags=[k], words=words))
d2 = Doc2Vec(size=200, documents=documents)
for doc in documents:
trained = d2.docvecs[doc.tags[0]]
inferred = d2.infer_vector(doc.words, steps=50)
print(cosine(trained, inferred)) # cosine similarity from scipy
What is the type of your documents object, and are you sure that it is a multiply-iterable object, so that the model can do all of its 16 passes over the set of TaggedDocument-shaped text examples? That is, does iter(documents) always return a fresh iterator, with all items as TaggedDocument-shaped objects with the right list-of-words in words and list-of-tags in tags? (A common error is to supply a corpus that can be iterated over only once, and then ignoring any logged hints/warnings that no real training has happening. The inference/similarity results from such a model will be essentially random.)
Then for infer_vector(), does documents[tag] really return just the list-of-words it expects (not TaggedDocument or string)? (Users often supply strings, rather than lists-of-tokens, for training or inference words and get results that are just noise.)
Was there evaluation-guided reason for changing various defaults, either a little (window=6, negative=8) or a lot (alpha=0.1, min_count=2)? Such tweaks may not be a major factor in your problem, and there's nothing magical about the class defaults. But until you have the basics working, it's best to stick close to common configuration. (And then even after the basics are working, limit changes to those that can be demonstrated as better via a repeatable scoring process.)
Some report needing much higher steps values – 100 or more – to get better inference results, though that would be most crucial for very-small documents (of a handful to couple dozen words) rather than the few-hundred-words documents you describe.
A corpus of 10k documents is on the small side for Paragraph Vectors (Doc2Vec), but with your smallish vector-size (100) and larger number of iterations (15), it might be workable.
If you're still having problems, you should expand your question with more code showing how documents works, some suggestive example documents, and your cosine-similarity evaluation process – to see if there are any oversights at each of those steps.

Sentence matching with gensim word2vec: manually populated model doesn't work

I'm trying to solve a problem of sentence comparison using naive approach of summing up word vectors and comparing the results. My goal is to match people by interest, so the dataset consists of names and short sentences describing their hobbies. The batches are fairly small, few hundreds of people, so i wanted to give it a try before digging into doc2vec.
I prepare the data by cleaning it completely, removing stop words, tokenizing and lemmatizing. I use pre-trained model for word vectors which returns adequate results when finding similarities for some test words. Also tried summing up the sentence words to find similarities in the original model - the matches do make sense. The similarities would be around general sense of the phrase.
For sentence matching I'm trying the following: create an empty model
b = gs.models.Word2Vec(min_count=1, size=300, sample=0, hs=0)
Build vocab out of names (or person id's), no training
#first create vocab with an empty vector
test = [['test']]
b.build_vocab(test)
b.wv.syn0[b.wv.vocab['test'].index] = b.wv.syn0[b.wv.vocab['test'].index]*0
#populate vocab from an array
b.build_vocab([personIds], update=True)
Summ each sentence's word vectors and store the results into the model for a corresponding id
#sentences are pulled from pandas dataset df. 'a' is a pre-trained model i use to get vectors for each word
def summ(phrase, start_model):
'''
vector addition function
'''
#starting with a vector of 0's
sum_vec = start_model.word_vec("cat_NOUN")*0
for word in phrase:
sum_vec += start_model.word_vec(word)
return sum_vec
for i, row in df.iterrows():
try:
personId = row["ID"]
summVec = summ(df.iloc[i,1],a)
#updating syn0 for each name/id in vocabulary
b.wv.syn0[b.wv.vocab[personId].index] = summVec
except:
pass
I understand that i shouldn't be expecting much accuracy here, but the t-SNE print doesn't show any clustering whatsoever. Finding similarities method also fails to find matches (<0.2 similarity coefficient basically for everything). [
Wondering if anyone has an idea of where did i go wrong? Is my approach valid at all?
Your code, as shown, neither does any train() of word-vectors (using your local text), nor does it pre-load any vectors from elsewhere. So any vectors which do exist – created by the build_vocab() calls – will still just be in their randomly-initialized starting locations, and be useless for any semantic purposes.
Suggestions:
either (a) train your own vectors from your text, which makes sense if you have a good quantity of text; or (b) load vectors from elsewhere. But don't try to do both. (Or, in the case of the code above, neither.)
The update=True option for build_vocab() should be considered an expert, experimental option – only worth tinkering with if you've already had things working in simpler modes, and you're sure you need it and understand all the implications.
Normal use won't ever explicitly re-assign new values into the Word2Vec model's syn0 property - those are managed by the class's training routines, so you never need to zero them out or modify them. You should tally up your own text summary vectors, based on word-vectors, outside the model in your own data structures.

Categories