Gensim Word2Vec distances are too close - python

I am training my own word2vec model on Gensim in python, on a relatively small dataset. The data consist of about 3000 short-text entries from different people, most of which are two or three sentences. I know this is small for a word2vec dataset, but I've seen similar ones work in the past.
For some reason, when I train my model all of the features are impractically close to one another. For instance:
model.most_similar('jesus/NN')
[(u'person/NN', 0.9999418258666992),
(u'used/VB', 0.9998890161514282),
(u'so/RB', 0.9998359680175781),
(u'question/NN', 0.9997845888137817),
(u'just/RB', 0.9996646642684937),
(u'other/NN', 0.9995589256286621),
(u'allow/VB', 0.9995476603507996),
(u'feel/VB', 0.9995381236076355),
(u'attend/VB', 0.9995047450065613),
(u'make/VB', 0.9994802474975586)]
The parts of speech are included because I lemmatize the data.
Here is my training code:
cleanedResponses = []
#For every response
for rawValue in df['QB17_W6'].values:
lemValue = lemmatize(rawValue)
cleanedResponses.append(lemValue)
df['cleaned_responses'] = cleanedResponses
bigram_transformer = Phrases(df.cleaned_responses.values)
model = Word2Vec(bigram_transformer[df.cleaned_responses.values], size=5)
This also happens when I train without the bigram transformer. Does anybody have an idea as to why the distances are so close?

Related

Doc2Vec results not as expected

I'm evaluating Doc2Vec for a recommender API. I wasn't able to find a decent pre-trained model, so I trained a model on the corpus, which is about 8,000 small documents.
model = Doc2Vec(vector_size=25,
alpha=0.025,
min_alpha=0.00025,
min_count=1,
dm=1)
I then looped through the corpus to find similar documents for each document. Results were not very good (in comparison to TF-IDF). Note this is after testing different epochs and vector sizes.
inferred_vector = model.infer_vector(row['cleaned'].split())
sims = model.docvecs.most_similar([inferred_vector], topn=4)
I also tried extracting the trained vectors and using cosine_similarty, but results were oddly even worse.
cosine_similarities = cosine_similarity(model.docvecs.vectors_docs, model.docvecs.vectors_docs)
Am I doing something wrong or is the smallish corpus the problem?
Edit:
Prep and training code
def unesc(s):
for idx, row in s.iteritems():
s[idx] = html.unescape(row)
return s
custom_pipeline = [
preprocessing.lowercase,
unesc,
preprocessing.remove_urls,
preprocessing.remove_html_tags,
preprocessing.remove_diacritics,
preprocessing.remove_digits,
preprocessing.remove_punctuation,
lambda s: hero.remove_stopwords(s, stopwords=custom_stopwords),
preprocessing.remove_whitespace,
preprocessing.tokenize
]
ds['cleaned'] = ds['body'].pipe(hero.clean, pipeline=custom_pipeline)
w2v_total_data = list(ds['cleaned'])
tag_data = [TaggedDocument(words=doc, tags=[str(i)]) for i, doc in enumerate(w2v_total_data)]
model = Doc2Vec(vector_size=25,
alpha=0.025,
min_alpha=0.00025,
min_count=1,
epochs=20,
dm=1)
model.build_vocab(tag_data)
model.train(tag_data, total_examples=model.corpus_count, epochs=model.epochs)
model.save("lmdocs_d2v.model")
Without seeing your training code, there could easily be errors in text prep & training. Many online code examples are bonkers wrong in their Doc2Vec training technique!
Note that min_count=1 is essentially always a bad idea with this sort of algorithm: any example suggesting that was likely from a misguided author.
Is a mere .split() also the only tokenization applied for training? (The inference list-of-tokens should be prepped the same as the training lists-of-tokens.)
How was "not very good" and "oddly even worse" evaluated? For example, did the results seem arbitrary, or in-the-right-direction-but-just-weak?
"8,000 small documents" is a bit on the thin side for a training corpus, but it somewhat depends on "how small" – a few words, a sentence, a few sentences? Moving to smaller vectors, or more training epochs, can sometimes make the best of a smallish training set - but this sort of algorithm works best with lots of data, such that dense 100d-or-more vectors can be trained.

How to generate independent(X) variable using Word2vec?

I have a movie review data set which has two columns Review(Sentences) and Sentiment(1 or 0).
I want to create a classification model using word2vec for the embedding and a CNN for the classification.
I've looked for tutorials on youtube but all they do is create vectors for every words and show me the similar words. Like this-
model= gensim.models.Word2Vec(cleaned_dataset, min_count = 2, size = 100, window = 5)
words= model.wv.vocab
simalar= model.wv.most_similar("bad")
I already have my dependent variable(y) which is my 'Sentiment' column all I need is the independent variable(X) which I can pass on to my CNN model.
Before using word2vec I used the Bag Of Words(BOW) model which generated a sparse matrix which was my independent(X) variable. How can I achieve something similar using word2vec?
Kindly correct me if I'm doing something wrong.
To get the word vector, you have to do this:
model['word_that_you_want']
You may also want to handle the KeyError that could arise if you don't find that given word in your model. You also might want to read about what an embedding layer is, which is usually used as the first layer of the neural network (for NLP generally) and is basically a lookup mapping of a word to its corresponding word vector.
To get the word vectors for an entire sentence, you need to first initialize a numpy array of zeros to the dimensions you want.
You might need other variables such as the length of the longest sentence so that you can pad all sentences to that length. The documentation of the pad_sequences method for Keras is here.
A simple example of getting a sentence of word vectors is:
import numpy as np
embedding_matrix = np.zeros((vocab_len, size_of_your_word_vector))
Then iterate over the index of embedding_matrix and add to it, if you find a word vector in your model.
I use this resource which has a lot of examples and I have referenced some of the code there (which I have also used myself sometimes):
embedding_matrix = np.zeros((vocab_length, 100))
for word, index in word_tokenizer.word_index.items():
embedding_vector = model[word] # using your w2v model, KeyError possible
if embedding_vector is not None:
embedding_matrix[index] = embedding_vector
And in your model (I'm assuming Tensorflow with Keras)
embedding_layer = Embedding(vocab_length, 100, weights=[embedding_matrix], input_length=length_long_sentence, trainable=False)
I hope this helps.
Word2Vec doesn't inherently create vectors for a text (set of words) – just individual words.
But, sometimes a not-so-bad vector for a multi-word text is the average of all its word-vectors.
If list_of_words is a list of the words in your text, and all the words are in the Word2Vec model, a simple way to get the average of those words' vectors is:
avg_vector_of_words = model.wv[list_of_words].mean(axis=0)
(If some words aren't present, you'd need to filter them before attempting this to avoid KeyErrors. If you wanted to leave out some words, or use unit-normed word-vectors, or unit-normalize the final vector, you'd need more code.)
Then avg_vector_of_words is a small, dense/'embedded' feature vector for the list_of-words text.
You could pass these vectors, one per text, to another downstream classifier, like your CNN, exactly analogously to how you were previously using sparse BOW vectors.

how to merge two Word2Vec File

I created my model using Word2Vec.
But the results were not good.
So I want to add a word.
The code I created the first time
Creation is possible, but can not be added.
Please tell me how can add.
createModel.py
token = loadCsv("test_data")
embeddingmodel = []
for i in range(len(token)):
temp_embeddingmodel = []
for k in range(len(token[i][0])):
temp_embeddingmodel.append(token[i][0][k])
embeddingmodel.append(temp_embeddingmodel)
embedding = Word2Vec(embeddingmodel, size=300, window=5, min_count=3, iter=100, sg=1,workers=4, max_vocab_size = 360000000)
embedding.save('post.embedding')
loadWord2Vec.py
tokens = W2V.tokenize(sentence)
embedding = Convert2Vec('Data/post.embedding', tokens)
zero_pad = W2V.Zero_padding(embedding, Batch_size, Maxseq_length, Vector_size)
Tell me how to add or merge the results of Word2Vec
There's no easy way to merge two Word2Vec models.
Only word-vectors that were trained together are "in the same space" and thus comparable.
The best policy would be to combine the two training corpuses of texts, and train a new model on the combined data, thus obtaining word-vectors for all words from the same training session.

Suspiciously high accuracy in sentiment analysis model

I am building a sentiment analysis model using NLTK and scikitlearn. I have decided to test a few different classifiers in order to see which is most accurate, and eventually use all of them as a means of producing a confidence score.
The datasets used for this testing were all reviews, labelled as either positive or negative.
I trained each classifier with 5,000 reviews, 5 separate times, with 6 different (but very similar) datasets. Each test was done with a new set of 5000 reviews.
I averaged the accuracy for each test and dataset, to arrive at an overall mean accuracy. Take a look:
Multinomial Naive Bayes: 91.291%
Logistic Regression: 96.103%
SVC: 95.844%
In some tests, the accuracy was as high as 99.912%. In fact, the lowest mean accuracy for one of the datasets was 81.524%.
Here's a relevant code snippet:
def get_features(comment, word_features):
features = {}
for word in word_features:
features[word] = (word in set(comment))
return features
def main(dataset_name, column, limit):
data = get_data(column, limit)
data = clean_data(data) # filter stop words
all_words = [w.lower() for (comment, category) in data for w in comment]
word_features = nltk.FreqDist(all_words).keys()
feature_set = [(get_features(comment, word_features), category) for
(comment, category) in data]
run = 0
while run < 5:
random.shuffle(feature_set)
training_set = feature_set[:int(len(data) / 2.)]
testing_set = feature_set[int(len(data) / 2.):]
classifier = SklearnClassifier(SVC())
classifier.train(training_set)
acc = nltk.classify.accuracy(classifier, testing_set) * 100.
save_acc(acc) # function to save results as .csv
run += 1
Although I know that these kinds of classifiers can typically return great results, this seems a little too good to be true.
What are some things that I need to check to be sure this is valid?
It's not so good if you get a range from 99,66% to 81,5%.
To analyze dataset in case of text classification, you can check:
If the dataset is balanced?
Distribution words for each label, sometimes the vocabulary used for each label can be really different.
Positive/negative, but for the same source? Like the point before maybe if the domain is not the same, the reviews can use different expressions for a positive o negative review. This helps to get a high accuracy in several source.
Try with a review from different source.
If after all you get that high accuracy, congrat! your get_features is really good. :)

How can I improve the cosine similarity of two documents(sentences) in doc2vec model?

I am building a NLP chat application in Python using gensim library through doc2vec model. I have hard coded documents and given a set of training examples, I am testing the model by throwing a user question and then finding most similar documents as a first step. In this case my test question is an exact copy of a document from training example.
import gensim
from gensim import models
sentence = models.doc2vec.LabeledSentence(words=[u'sampling',u'what',u'is',u'tell',u'me',u'about'],tags=["SENT_0"])
sentence1 = models.doc2vec.LabeledSentence(words=[u'eligibility',u'what',u'is',u'my',u'limit',u'how',u'much',u'can',u'I',u'claim'],tags=["SENT_1"])
sentence2 = models.doc2vec.LabeledSentence(words=[u'eligibility',u'I',u'am',u'retiring',u'how',u'much',u'can',u'claim',u'have', u'resigned'],tags=["SENT_2"])
sentence3 = models.doc2vec.LabeledSentence(words=[u'what',u'is',u'my',u'eligibility',u'post',u'my',u'promotion'],tags=["SENT_3"])
sentence4 = models.doc2vec.LabeledSentence(words=[u'what',u'is', u'my',u'eligibility' u'post',u'my',u'promotion'], tags=["SENT_4"])
sentences = [sentence, sentence1, sentence2, sentence3, sentence4]
class LabeledLineSentence(object):
def __init__(self, filename):
self.filename = filename
def __iter__(self):
for uid, line in enumerate(open(filename)):
yield LabeledSentence(words=line.split(), labels=['SENT_%s' % uid])
model = models.Doc2Vec(alpha=0.03, min_alpha=.025, min_count=2)
model.build_vocab(sentences)
for epoch in range(30):
model.train(sentences, total_examples=model.corpus_count, epochs = model.iter)
model.alpha -= 0.002 # decrease the learning rate`
model.min_alpha = model.alpha # fix the learning rate, no decay
model.save("my_model.doc2vec")
model_loaded = models.Doc2Vec.load('my_model.doc2vec')
print (model_loaded.docvecs.most_similar(["SENT_4"]))
Result:
[('SENT_1', 0.043695494532585144), ('SENT_2', 0.0017897281795740128), ('SENT_0', -0.018954679369926453), ('SENT_3', -0.08253869414329529)]
Similarity of SENT_4 and SENT_3 is only -0.08253869414329529 when it should be 1 since they are exactly same. How should I improve this accuracy? Is there a specific way of training documents and I am missing something out?
Word2Vec/Doc2Vec don't work well on toy-sized examples (such as few texts, short texts, and few total words). Many of the desirable properties are only reliably achieved with training sets of millions of words, or tens-of-thousands of documents.
In particular, with only 5 examples, and only a dozen or two words, but 100-dimensions of modeling vectors, the training isn't forced to do the main thing which makes word-vectors/doc-vectors useful: compress representations into dense embeddings, where similar items need to be incrementally nudged near each other in vector space, because there's no way to retain all the original variation in a sort-of-giant-lookup-table. With more dimensions than corpus variation, your identical-tokens SENT_3 and SENT_4 can adopt wildly different doc-vectors, and the model is still large enough to do great on its training task (essentially, 'overfit'), without the desired end-state of similar-texts having similar-vectors being forced.
You can sometimes squeeze a little more meaning out of small datasets with more training iterations, and a much-smaller model (in terms of vector size), but really: these vectors need large, varied datasets to become meaningful.
That's the main issue. Some other inefficiencies or errors in your example code:
Your code doesn't use the class LabeledLineSentence, so there's no need to include it here – it's irrelevant boilerplate. (Also, TaggedDocument is the preferred name for the words+tags document class in recent gensim versions, rather than LabeledSentence.)
Your custom-management of alpha and min_alpha is unlikely to do anything useful. These are best left at their defaults unless you already have something working, understand the algorithm well, and then want to try subtle optimizations.
train() will do its own iterations, so you don't need to call it many times in an outer loop. (This code as written does in its first loop 5 model.iter iterations at alpha values gradually descending from 0.03 to 0.025, then 5 iterations at a fixed alpha of 0.028, then 5 more at 0.026, then 27 more sets of 5 iterations at decreasing alpha, ending on the 30th loop at a fixed alpha of -0.028. That's a nonsense ending value – the learning-rate should never be negative – at the end of a nonsense progression. Even with a big dataset, these 150 iterations, about half happening at negative alpha values, would likely yield weird results.)

Categories