I'm evaluating Doc2Vec for a recommender API. I wasn't able to find a decent pre-trained model, so I trained a model on the corpus, which is about 8,000 small documents.
model = Doc2Vec(vector_size=25,
alpha=0.025,
min_alpha=0.00025,
min_count=1,
dm=1)
I then looped through the corpus to find similar documents for each document. Results were not very good (in comparison to TF-IDF). Note this is after testing different epochs and vector sizes.
inferred_vector = model.infer_vector(row['cleaned'].split())
sims = model.docvecs.most_similar([inferred_vector], topn=4)
I also tried extracting the trained vectors and using cosine_similarty, but results were oddly even worse.
cosine_similarities = cosine_similarity(model.docvecs.vectors_docs, model.docvecs.vectors_docs)
Am I doing something wrong or is the smallish corpus the problem?
Edit:
Prep and training code
def unesc(s):
for idx, row in s.iteritems():
s[idx] = html.unescape(row)
return s
custom_pipeline = [
preprocessing.lowercase,
unesc,
preprocessing.remove_urls,
preprocessing.remove_html_tags,
preprocessing.remove_diacritics,
preprocessing.remove_digits,
preprocessing.remove_punctuation,
lambda s: hero.remove_stopwords(s, stopwords=custom_stopwords),
preprocessing.remove_whitespace,
preprocessing.tokenize
]
ds['cleaned'] = ds['body'].pipe(hero.clean, pipeline=custom_pipeline)
w2v_total_data = list(ds['cleaned'])
tag_data = [TaggedDocument(words=doc, tags=[str(i)]) for i, doc in enumerate(w2v_total_data)]
model = Doc2Vec(vector_size=25,
alpha=0.025,
min_alpha=0.00025,
min_count=1,
epochs=20,
dm=1)
model.build_vocab(tag_data)
model.train(tag_data, total_examples=model.corpus_count, epochs=model.epochs)
model.save("lmdocs_d2v.model")
Without seeing your training code, there could easily be errors in text prep & training. Many online code examples are bonkers wrong in their Doc2Vec training technique!
Note that min_count=1 is essentially always a bad idea with this sort of algorithm: any example suggesting that was likely from a misguided author.
Is a mere .split() also the only tokenization applied for training? (The inference list-of-tokens should be prepped the same as the training lists-of-tokens.)
How was "not very good" and "oddly even worse" evaluated? For example, did the results seem arbitrary, or in-the-right-direction-but-just-weak?
"8,000 small documents" is a bit on the thin side for a training corpus, but it somewhat depends on "how small" – a few words, a sentence, a few sentences? Moving to smaller vectors, or more training epochs, can sometimes make the best of a smallish training set - but this sort of algorithm works best with lots of data, such that dense 100d-or-more vectors can be trained.
Related
currently I'am training my Word2Vec + LSTM for Twitter sentiment analysis. I use the pre-trained GoogleNewsVectorNegative300 word embedding. The reason I used the pre-trained GoogleNewsVectorNegative300 because the performance much worse when I trained my own Word2Vec using own dataset. The problem is why my training process had validation acc and loss stuck at 0.88 and 0.34 respectively. Then, my confussion matrix also seems wrong. Here several processes that I have done before fitting the model
Text Pre processing:
Lower casing
Remove hashtag, mentions, URLs, numbers, change words to numbers, non-ASCII characters, retweets "RT"
Expand contractions
Replace negations with antonyms
Remove puncutations
Remove stopwords
Lemmatization
I split my dataset into 90:10 for train:test as follows:
def split_data(X, y):
X_train, X_test, y_train, y_test = train_test_split(X,
y,
train_size=0.9,
test_size=0.1,
stratify=y,
random_state=0)
return X_train, X_test, y_train, y_test
The split data resulting in training has 2060 samples with 708 positive sentiment class, 837 negative sentiment class, and 515 sentiment neutral class
Then, I implemented the text augmentation that is EDA (Easy Data Augmentation) on all the training data as follows:
class TextAugmentation:
def __init__(self):
self.augmenter = EDA()
def replace_synonym(self, text):
augmented_text_portion = int(len(text)*0.1)
synonym_replaced = self.augmenter.synonym_replacement(text, n=augmented_text_portion)
return synonym_replaced
def random_insert(self, text):
augmented_text_portion = int(len(text)*0.1)
random_inserted = self.augmenter.random_insertion(text, n=augmented_text_portion)
return random_inserted
def random_swap(self, text):
augmented_text_portion = int(len(text)*0.1)
random_swaped = self.augmenter.random_swap(text, n=augmented_text_portion)
return random_swaped
def random_delete(self, text):
random_deleted = self.augmenter.random_deletion(text, p=0.5)
return random_deleted
text_augmentation = TextAugmentation()
The data augmentation resulting in training has 10300 samples with 3540 positive sentiment class, 4185 negative sentiment class, and 2575 sentiment neutral class
Then, I tokenized the sequence as follows:
# Tokenize the sequence
pfizer_tokenizer = Tokenizer(oov_token='OOV')
pfizer_tokenizer.fit_on_texts(df_pfizer_train['text'].values)
X_pfizer_train_tokenized = pfizer_tokenizer.texts_to_sequences(df_pfizer_train['text'].values)
X_pfizer_test_tokenized = pfizer_tokenizer.texts_to_sequences(df_pfizer_test['text'].values)
# Pad the sequence
X_pfizer_train_padded = pad_sequences(X_pfizer_train_tokenized, maxlen=100)
X_pfizer_test_padded = pad_sequences(X_pfizer_test_tokenized, maxlen=100)
pfizer_max_length = 100
pfizer_num_words = len(pfizer_tokenizer.word_index) + 1
# Encode label
y_pfizer_train_encoded = df_pfizer_train['sentiment'].factorize()[0]
y_pfizer_test_encoded = df_pfizer_test['sentiment'].factorize()[0]
y_pfizer_train_category = to_categorical(y_pfizer_train_encoded)
y_pfizer_test_category = to_categorical(y_pfizer_test_encoded)
Resulting in 8869 unique words and 100 maximum sequence length
Finally, I fit the into my model using pre trained GoogleNewsVectorNegative300 word embedding but only use the weight and LSTM, and I split my training data again with 10% for validation as follows:
# Build single LSTM model
def build_lstm_model(embedding_matrix, max_sequence_length):
# Input layer
input_layer = Input(shape=(max_sequence_length,), dtype='int32')
# Word embedding layer
embedding_layer = Embedding(input_dim=embedding_matrix.shape[0],
output_dim=embedding_matrix.shape[1],
weights=[embedding_matrix],
input_length=max_sequence_length,
trainable=True)(input_layer)
# LSTM model layer
lstm_layer = LSTM(units=128,
dropout=0.5,
return_sequences=True)(embedding_layer)
batch_normalization = BatchNormalization()(lstm_layer)
lstm_layer = LSTM(units=128,
dropout=0.5,
return_sequences=False)(batch_normalization)
batch_normalization = BatchNormalization()(lstm_layer)
# Dense model layer
dense_layer = Dense(units=128, activation='relu')(batch_normalization)
dropout_layer = Dropout(rate=0.5)(dense_layer)
batch_normalization = BatchNormalization()(dropout_layer)
output_layer = Dense(units=3, activation='softmax')(batch_normalization)
lstm_model = Model(inputs=input_layer, outputs=output_layer)
return lstm_model
# Building single LSTM model
sinovac_lstm_model = build_lstm_model(SINOVAC_EMBEDDING_MATRIX, SINOVAC_MAX_SEQUENCE)
sinovac_lstm_model.summary()
sinovac_lstm_model.compile(loss='categorical_crossentropy',
optimizer=Adam(learning_rate=0.001),
metrics=['accuracy'])
sinovac_lstm_history = sinovac_lstm_model.fit(x=X_sinovac_train,
y=y_sinovac_train,
batch_size=64,
epochs=20,
validation_split=0.1,
verbose=1)
The training result:
The evaluation result:
I really need some suggestions or insights to have a good accuracy on my test
Without reviewing everything, a few high-order things that may be limiting your results:
The GoogleNews vectors were trained on media-outlet news stories from 2012 and earlier. Tweets in 2020+ use a very different style of language. I wouldn't necessarily expect those pretrained vectors, from a different era & domain-of-writing, to be very good at modeling the words you'll need. A well-trained word2vec model (using plenty of modern tweet data, with good preprocessing/tokenization & parameterization choices) has a good chance of working better, so you may want to revisit that choice.
The GoogleNews training texts preprocessing, while as far as I can tell never fully-documented, did not appear to flatten all casing, nor remove stopwords, nor involve lemmatization. It didn't mutate obvious negations into antonyms, but it did perform a statistical combinations of some single-words into multigram tokens instead. So some of your steps are potentially causing your tokens to have less concordance with that set's vectors – even throwing away info, like inflectional variations of words, that could be beneficially retained. Be sure every step you're taking is worth the trouble – and note that a suffiicient modern word2vec moel, on Tweets, built using the same preprocessing for word2vec training then later steps, would match vocabularies perfectly.
Both the word2vec model, and any deeper neural network, often need lots of data to train well, and avoid overfitting. Even disregarding the 900 million parameters from GoogleNews, you're trying to train ~130k parameters – at least 520KB of state – from an initial set of merely 2060 tweet-sized texts (maybe 100KB of data). Models that learn generalizable things tend to be compressions of the data, in some sense, and a model that's much larger than the training data brings risk of severe overfitting. (Your mechanistic process for replacing words with synonyms may not be really giving the model any info that the word-vector similarity between synonyms didn't already provide.) So: consider shrinking your model, and getting much more training data - potentially even from other domains than your main classification interest, as long as the use-of-language is similar.
I created my model using Word2Vec.
But the results were not good.
So I want to add a word.
The code I created the first time
Creation is possible, but can not be added.
Please tell me how can add.
createModel.py
token = loadCsv("test_data")
embeddingmodel = []
for i in range(len(token)):
temp_embeddingmodel = []
for k in range(len(token[i][0])):
temp_embeddingmodel.append(token[i][0][k])
embeddingmodel.append(temp_embeddingmodel)
embedding = Word2Vec(embeddingmodel, size=300, window=5, min_count=3, iter=100, sg=1,workers=4, max_vocab_size = 360000000)
embedding.save('post.embedding')
loadWord2Vec.py
tokens = W2V.tokenize(sentence)
embedding = Convert2Vec('Data/post.embedding', tokens)
zero_pad = W2V.Zero_padding(embedding, Batch_size, Maxseq_length, Vector_size)
Tell me how to add or merge the results of Word2Vec
There's no easy way to merge two Word2Vec models.
Only word-vectors that were trained together are "in the same space" and thus comparable.
The best policy would be to combine the two training corpuses of texts, and train a new model on the combined data, thus obtaining word-vectors for all words from the same training session.
I am building a sentiment analysis model using NLTK and scikitlearn. I have decided to test a few different classifiers in order to see which is most accurate, and eventually use all of them as a means of producing a confidence score.
The datasets used for this testing were all reviews, labelled as either positive or negative.
I trained each classifier with 5,000 reviews, 5 separate times, with 6 different (but very similar) datasets. Each test was done with a new set of 5000 reviews.
I averaged the accuracy for each test and dataset, to arrive at an overall mean accuracy. Take a look:
Multinomial Naive Bayes: 91.291%
Logistic Regression: 96.103%
SVC: 95.844%
In some tests, the accuracy was as high as 99.912%. In fact, the lowest mean accuracy for one of the datasets was 81.524%.
Here's a relevant code snippet:
def get_features(comment, word_features):
features = {}
for word in word_features:
features[word] = (word in set(comment))
return features
def main(dataset_name, column, limit):
data = get_data(column, limit)
data = clean_data(data) # filter stop words
all_words = [w.lower() for (comment, category) in data for w in comment]
word_features = nltk.FreqDist(all_words).keys()
feature_set = [(get_features(comment, word_features), category) for
(comment, category) in data]
run = 0
while run < 5:
random.shuffle(feature_set)
training_set = feature_set[:int(len(data) / 2.)]
testing_set = feature_set[int(len(data) / 2.):]
classifier = SklearnClassifier(SVC())
classifier.train(training_set)
acc = nltk.classify.accuracy(classifier, testing_set) * 100.
save_acc(acc) # function to save results as .csv
run += 1
Although I know that these kinds of classifiers can typically return great results, this seems a little too good to be true.
What are some things that I need to check to be sure this is valid?
It's not so good if you get a range from 99,66% to 81,5%.
To analyze dataset in case of text classification, you can check:
If the dataset is balanced?
Distribution words for each label, sometimes the vocabulary used for each label can be really different.
Positive/negative, but for the same source? Like the point before maybe if the domain is not the same, the reviews can use different expressions for a positive o negative review. This helps to get a high accuracy in several source.
Try with a review from different source.
If after all you get that high accuracy, congrat! your get_features is really good. :)
I am building a NLP chat application in Python using gensim library through doc2vec model. I have hard coded documents and given a set of training examples, I am testing the model by throwing a user question and then finding most similar documents as a first step. In this case my test question is an exact copy of a document from training example.
import gensim
from gensim import models
sentence = models.doc2vec.LabeledSentence(words=[u'sampling',u'what',u'is',u'tell',u'me',u'about'],tags=["SENT_0"])
sentence1 = models.doc2vec.LabeledSentence(words=[u'eligibility',u'what',u'is',u'my',u'limit',u'how',u'much',u'can',u'I',u'claim'],tags=["SENT_1"])
sentence2 = models.doc2vec.LabeledSentence(words=[u'eligibility',u'I',u'am',u'retiring',u'how',u'much',u'can',u'claim',u'have', u'resigned'],tags=["SENT_2"])
sentence3 = models.doc2vec.LabeledSentence(words=[u'what',u'is',u'my',u'eligibility',u'post',u'my',u'promotion'],tags=["SENT_3"])
sentence4 = models.doc2vec.LabeledSentence(words=[u'what',u'is', u'my',u'eligibility' u'post',u'my',u'promotion'], tags=["SENT_4"])
sentences = [sentence, sentence1, sentence2, sentence3, sentence4]
class LabeledLineSentence(object):
def __init__(self, filename):
self.filename = filename
def __iter__(self):
for uid, line in enumerate(open(filename)):
yield LabeledSentence(words=line.split(), labels=['SENT_%s' % uid])
model = models.Doc2Vec(alpha=0.03, min_alpha=.025, min_count=2)
model.build_vocab(sentences)
for epoch in range(30):
model.train(sentences, total_examples=model.corpus_count, epochs = model.iter)
model.alpha -= 0.002 # decrease the learning rate`
model.min_alpha = model.alpha # fix the learning rate, no decay
model.save("my_model.doc2vec")
model_loaded = models.Doc2Vec.load('my_model.doc2vec')
print (model_loaded.docvecs.most_similar(["SENT_4"]))
Result:
[('SENT_1', 0.043695494532585144), ('SENT_2', 0.0017897281795740128), ('SENT_0', -0.018954679369926453), ('SENT_3', -0.08253869414329529)]
Similarity of SENT_4 and SENT_3 is only -0.08253869414329529 when it should be 1 since they are exactly same. How should I improve this accuracy? Is there a specific way of training documents and I am missing something out?
Word2Vec/Doc2Vec don't work well on toy-sized examples (such as few texts, short texts, and few total words). Many of the desirable properties are only reliably achieved with training sets of millions of words, or tens-of-thousands of documents.
In particular, with only 5 examples, and only a dozen or two words, but 100-dimensions of modeling vectors, the training isn't forced to do the main thing which makes word-vectors/doc-vectors useful: compress representations into dense embeddings, where similar items need to be incrementally nudged near each other in vector space, because there's no way to retain all the original variation in a sort-of-giant-lookup-table. With more dimensions than corpus variation, your identical-tokens SENT_3 and SENT_4 can adopt wildly different doc-vectors, and the model is still large enough to do great on its training task (essentially, 'overfit'), without the desired end-state of similar-texts having similar-vectors being forced.
You can sometimes squeeze a little more meaning out of small datasets with more training iterations, and a much-smaller model (in terms of vector size), but really: these vectors need large, varied datasets to become meaningful.
That's the main issue. Some other inefficiencies or errors in your example code:
Your code doesn't use the class LabeledLineSentence, so there's no need to include it here – it's irrelevant boilerplate. (Also, TaggedDocument is the preferred name for the words+tags document class in recent gensim versions, rather than LabeledSentence.)
Your custom-management of alpha and min_alpha is unlikely to do anything useful. These are best left at their defaults unless you already have something working, understand the algorithm well, and then want to try subtle optimizations.
train() will do its own iterations, so you don't need to call it many times in an outer loop. (This code as written does in its first loop 5 model.iter iterations at alpha values gradually descending from 0.03 to 0.025, then 5 iterations at a fixed alpha of 0.028, then 5 more at 0.026, then 27 more sets of 5 iterations at decreasing alpha, ending on the 30th loop at a fixed alpha of -0.028. That's a nonsense ending value – the learning-rate should never be negative – at the end of a nonsense progression. Even with a big dataset, these 150 iterations, about half happening at negative alpha values, would likely yield weird results.)
I am training my own word2vec model on Gensim in python, on a relatively small dataset. The data consist of about 3000 short-text entries from different people, most of which are two or three sentences. I know this is small for a word2vec dataset, but I've seen similar ones work in the past.
For some reason, when I train my model all of the features are impractically close to one another. For instance:
model.most_similar('jesus/NN')
[(u'person/NN', 0.9999418258666992),
(u'used/VB', 0.9998890161514282),
(u'so/RB', 0.9998359680175781),
(u'question/NN', 0.9997845888137817),
(u'just/RB', 0.9996646642684937),
(u'other/NN', 0.9995589256286621),
(u'allow/VB', 0.9995476603507996),
(u'feel/VB', 0.9995381236076355),
(u'attend/VB', 0.9995047450065613),
(u'make/VB', 0.9994802474975586)]
The parts of speech are included because I lemmatize the data.
Here is my training code:
cleanedResponses = []
#For every response
for rawValue in df['QB17_W6'].values:
lemValue = lemmatize(rawValue)
cleanedResponses.append(lemValue)
df['cleaned_responses'] = cleanedResponses
bigram_transformer = Phrases(df.cleaned_responses.values)
model = Word2Vec(bigram_transformer[df.cleaned_responses.values], size=5)
This also happens when I train without the bigram transformer. Does anybody have an idea as to why the distances are so close?