Why I get different length of vectors using gensim LSI model?

Why I get different length of vectors using gensim LSI model? - python

I'm trying to cluster some descriptions using LSI. As the dataset that I have is too long, I'm clustering based on the vectors obtained from the models instead of using the similarity matrix, which requires too much memory, and if I pick a sample, the matrix generated doesn't correspond to a square (this precludes the use of MDS).
However, after running the model and looking for the vectors I'm getting different vector's lengths in the descriptions. Most of them have a length of 300 (the num_topics argument in the model), but some few, with the same description, present a length of 299.
Why is this happening? Is there a way to correct it?
dictionary = gensim.corpora.Dictionary(totalvocab_lemmatized)
dictionary.compactify()
corpus = [dictionary.doc2bow(text) for text in totalvocab_lemmatized]
###tfidf model
tfidf = gensim.models.TfidfModel(corpus, normalize = True)
corpus_tfidf = tfidf[corpus]
###LSI model
lsi = gensim.models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=300)
vectors =[]
for n in lemmatized[:100]:
vec_bow = dictionary.doc2bow(n)
vec_lsi = lsi[vec_bow]
print(len(vec_lsi))

Explicit zeros are omitted, which is why some vectors appear shorter.
Source: https://github.com/RaRe-Technologies/gensim/issues/2501

Related

Keras Embeddings - how to match embedding vector with correct category?

I've got Keras model using functional api trained on samples with Embedding layer, embedding Class of 10 unique values into 4 dim vector.
After I load embedded weights
embeddings_pclass = model.get_layer('Class').get_weights()
I get list of 10 vectors say [.3,-.1,.9,.1] and so on. This is all fine and as planned. But I can't find anywhere how to do sort of reverse engineering to find out which Class matches with which vector ? My classes unique values are 1,2,3,5,7,8,9,12,15,19. Should I expect this is asc order of unique values to vectors in the list ??
Thanks
E

There is no explicit function in tf.keras.layers.Embedding to obtain classes, but you can view it as a problem of finding the nearest embedding in the weights matrix. The weight matrix rows correspond to classes (as you correctly guessed) so you can use argmin to get class value.
def get_class (emb_layer, encoded):
w = emb_layer.get_weights()[0]
distances = tf.math.reduce_euclidean_norm(w - encoded, axis = 1)
return tf.math.argmin(distances).numpy()
get_class(model.get_layer('Class'), tf.constant([.3,-.1,.9,.1]))

How to generate independent(X) variable using Word2vec?

I have a movie review data set which has two columns Review(Sentences) and Sentiment(1 or 0).
I want to create a classification model using word2vec for the embedding and a CNN for the classification.
I've looked for tutorials on youtube but all they do is create vectors for every words and show me the similar words. Like this-
model= gensim.models.Word2Vec(cleaned_dataset, min_count = 2, size = 100, window = 5)
words= model.wv.vocab
simalar= model.wv.most_similar("bad")
I already have my dependent variable(y) which is my 'Sentiment' column all I need is the independent variable(X) which I can pass on to my CNN model.
Before using word2vec I used the Bag Of Words(BOW) model which generated a sparse matrix which was my independent(X) variable. How can I achieve something similar using word2vec?
Kindly correct me if I'm doing something wrong.

To get the word vector, you have to do this:
model['word_that_you_want']
You may also want to handle the KeyError that could arise if you don't find that given word in your model. You also might want to read about what an embedding layer is, which is usually used as the first layer of the neural network (for NLP generally) and is basically a lookup mapping of a word to its corresponding word vector.
To get the word vectors for an entire sentence, you need to first initialize a numpy array of zeros to the dimensions you want.
You might need other variables such as the length of the longest sentence so that you can pad all sentences to that length. The documentation of the pad_sequences method for Keras is here.
A simple example of getting a sentence of word vectors is:
import numpy as np
embedding_matrix = np.zeros((vocab_len, size_of_your_word_vector))
Then iterate over the index of embedding_matrix and add to it, if you find a word vector in your model.
I use this resource which has a lot of examples and I have referenced some of the code there (which I have also used myself sometimes):
embedding_matrix = np.zeros((vocab_length, 100))
for word, index in word_tokenizer.word_index.items():
embedding_vector = model[word] # using your w2v model, KeyError possible
if embedding_vector is not None:
embedding_matrix[index] = embedding_vector
And in your model (I'm assuming Tensorflow with Keras)
embedding_layer = Embedding(vocab_length, 100, weights=[embedding_matrix], input_length=length_long_sentence, trainable=False)
I hope this helps.

Word2Vec doesn't inherently create vectors for a text (set of words) – just individual words.
But, sometimes a not-so-bad vector for a multi-word text is the average of all its word-vectors.
If list_of_words is a list of the words in your text, and all the words are in the Word2Vec model, a simple way to get the average of those words' vectors is:
avg_vector_of_words = model.wv[list_of_words].mean(axis=0)
(If some words aren't present, you'd need to filter them before attempting this to avoid KeyErrors. If you wanted to leave out some words, or use unit-normed word-vectors, or unit-normalize the final vector, you'd need more code.)
Then avg_vector_of_words is a small, dense/'embedded' feature vector for the list_of-words text.
You could pass these vectors, one per text, to another downstream classifier, like your CNN, exactly analogously to how you were previously using sparse BOW vectors.

Why I have a different number of terms in word2vec and TFIDF? How I can fix it?

I need multiply the weigths of terms in TFIDF matrix by the word-embeddings of word2vec matrix but I can't do it because each matrix have a different number of terms. I am using the same corpus for get both matrix, I don't know why each matrix have a different number of terms
.
My problem is that I have a matrix TFIDF with the shape (56096, 15500) (corresponding to: number of terms, number of documents) and matrix Word2vec with the shape (300, 56184) (corresponding to : number of word-embeddings, number of terms).
And I need the same numbers of terms in both matrix.
I use this code for get the matrix of word-embeddings Word2vec:
def w2vec_gensim(norm_corpus):
wpt = nltk.WordPunctTokenizer()
tokenized_corpus = [wpt.tokenize(document) for document in norm_corpus]
# Set values for various parameters
feature_size = 300
# Word vector dimensionality
window_context = 10
# Context window size
min_word_count = 1
# Minimum word count
sample = 1e-3
# Downsample setting for frequent words
w2v_model = word2vec.Word2Vec(tokenized_corpus, size=feature_size, window=window_context, min_count = min_word_count, sample=sample, iter=100)
words = list(w2v_model.wv.vocab)
vectors=[]
for w in words:
vectors.append(w2v_model[w].tolist())
embedding_matrix= np.array(vectors)
embedding_matrix= embedding_matrix.T
print(embedding_matrix.shape)
return embedding_matrix
And this code for get the TFIDF matrix:
tv = TfidfVectorizer(min_df=0., max_df=1., norm='l2', use_idf=True, smooth_idf=True)
def matriz_tf_idf(datos, tv):
tv_matrix = tv.fit_transform(datos)
tv_matrix = tv_matrix.toarray()
tv_matrix = tv_matrix.T
return tv_matrix
And I need the same number of terms in each matrix. For example, if I have 56096 terms in TFIDF, I need the same number in embeddings matrix, I mean matrix TFIDF with the shape (56096, 1550) and matrix of embeddings Word2vec with the shape (300, 56096). How I can get the same number of terms in both matrix?
Because I can't delete without more data, due to I need the multiplication to make sense because my goal is to get the embeddings from the documents.
Thank you very much in advance.

The problem is that TFIDF is cutting out around 90 terms. This is because tokenize is neccesary.
This is the solution:
wpt = nltk.WordPunctTokenizer()
tv = TfidfVectorizer(min_df=0., max_df=1., norm='l2', use_idf=True, smooth_idf=True,
tokenizer=wpt.tokenize)

Sklearn+Gensim: How to use Gensim's Word2Vec embedding for Sklearn text classification

I am building a multilabel text classification program and I am trying to use OneVsRestClassifier+XGBClassifier to classify the text. Initially I used Sklearn's Tf-Idf Vectorization to vectorize the texts, which worked without error. Now I am using Gensim's Word2Vec to vectorize the texts. When I feed the vectorized data into the OneVsRestClassifier+XGBClassifier however, I get the following error on the line where I split the test and training data:
TypeError: Singleton array array(,
dtype=object) cannot be considered a valid collection.
I have tried converting the vectorized data into a feature array (np.array), but that hasn't seemed to work.
Below is my code:
x = np.array(Word2Vec(textList, size=120, window=6, min_count=5, workers=7, iter=15))
vectorizer2 = MultiLabelBinarizer()
vectorizer2.fit(tagList)
y = vectorizer2.transform(tagList)
# Split test data and convert test data to arrays
xTrain, xTest, yTrain, yTest = train_test_split(x, y, test_size=0.20)
The variables textList and tagList are a list of strings (textual descriptions I am trying to classify).

x here becomes a numpy array conversion of the gensim.models.word2vec.Word2Vec object -- it is not actually the word2vec representations of textList that are returned.
Presumably, what you want to return is the corresponding vector for each word in a document (for a single vector representing each document, it would be better to use Doc2Vec).
For a set of documents in which the most verbose document contains n words, then, each document would be represented by an n * 120 matrix.
Unoptimized code for illustrative purposes:
import numpy as np
model = x = Word2Vec(textList, size=120, window=6,
min_count=5, workers=7, iter=15)
documents = []
for document in textList:
word_vectors = []
for word in document.split(' '): # or your logic for separating tokens
word_vectors.append(model.wv[word])
documents.append(np.concatenate(word_vectors))
# resulting in an n * 120 -- that is, `Word2Vec:size`-- array
document_matrix = np.concatenate(documents)

Linking the resulting TFIDF sparse vectors to the original documents in Spark

I am calculating the TFIDF using Spark with Python using the following code:
hashingTF = HashingTF()
tf = hashingTF.transform(documents)
idf = IDF().fit(tf)
tfidf = idf.transform(tf)
for k in tfidf.collect():
print(k)
I got the following results for three documents:
(1048576,[558379],[1.43841036226])
(1048576,[181911,558379,959994], [0.287682072452,0.287682072452,0.287682072452])
(1048576,[181911,959994],[0.287682072452,0.287682072452])
Assuming that I have thousands of documents, how to link the resulting TFIDF sparse vectors to the original documents knowing that I don't care about reversing the Hash-keys to the original terms.

Since both documents and tfidf have the same shape (number of partitions, number of elements per partition) and there no operations which require shuffle you can simply zip both RDDs:
documents.zip(tfidf)
Reversing HashingTF is for an obvious reason not possible.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why I get different length of vectors using gensim LSI model? - python

Explicit zeros are omitted, which is why some vectors appear shorter. Source: https://github.com/RaRe-Technologies/gensim/issues/2501

Related

Keras Embeddings - how to match embedding vector with correct category?

How to generate independent(X) variable using Word2vec?

Why I have a different number of terms in word2vec and TFIDF? How I can fix it?

Sklearn+Gensim: How to use Gensim's Word2Vec embedding for Sklearn text classification

Linking the resulting TFIDF sparse vectors to the original documents in Spark

Categories

Resources