Currently Fasttext is producing sentence vectors by taking average of the normalized word vectors of words in the sentence. Is it the best way to come up with the sentence vector.
Or whether using a tfidf weighting of words and then subtracting the first PCA component from it as discussed in this paper: https://openreview.net/pdf?id=SyK00v5xx will work better. And is there any such implementation already there within fasttext. If so where and how to use it through python.
Also, do I need to remove stop words from the sentence when I compute its fasttext vector?
Also in the python binding of fasttext how are to compute the sentence vector. There seem to be no syntax for that. Any comments.
Related
I've been searching for an answer to this specific question for a few hours and while I've learned a lot, I still haven't figured it out.
I have a dataset of ~70,000 sentences with subset of about 4,000 sentences that have been appropriately categorized, the rest are uncategorized. Currently I'm using a scikit pipeline with CountVectorizer and TfidfTransformer to vectorize the data, however I'm only vectorizing based off the 4,000 sentences and then testing various models via cross-validation.
I'm wondering if there is a way to use Word2Vec or something similar to vectorize the entire corpus of data and then use these vectors with my subset of 4,000 sentences. My intention is to increase the accuracy of my model predictions by using word vectors that incorporate all of the semantic data in the corpus rather than just data from the 4,000 sentences.
The code I'm currently using is:
svc = Pipeline([('vect', CountVectorizer(ngram_range=(3, 5))),
('tfidf', TfidfTransformer()),
('clf', LinearSVC()),
])
nb.fit(X_train, y_train)
y_pred = svc.predict(X_test)
Where X_train and y_train are my features and labels, respectively. I also have a list z_all which includes all remaining uncategorized features.
Just getting pointed in the right direction (or told whether or not this is possible) would be super helpful.
Thank you!
I would say that the answer is yes: you can use Word2Vec or another similar word-embedding method to get vectors of each sentence in your data, and then use these vectors both as training and testing data in a linear Support Vector Machine (SVC).
And yes, you can first create those vectors for your entire corpus of ~70,000 sentences before actually doing any training on your data.
It is however not as straightforward as the approach you're currently using.
There are many different ways to do this so I'll just go through one of them to help you get the basics of how this can be done.
Before we start and see what possible steps you can follow, let's remember that the goal here is to get one vector for each and every sentence of your corpus.
If you don't know what word-embeddings are, I highly suggest you to read about it, but in short this is just a way to link each word of a pre-defined vocabulary to a vector of a given dimension.
For instance, you would have:
# the vector associated with the word "cat" is the following vector of fixed-length
word_embeddings["cat"] = [0.0014, 0.6710, ..., 0.3281]
Now that you know this, here are the steps you could be following:
Tokenization - The first thing that you want to do is to tokenize each of your sentences. This can be done using a NLP library (SpaCy for instance) that will help you to:
split each sentence in a list of words
remove any punctuation from these words and converting them to lowercase
remove stopwords - optionally
lemmatize all the words - optionally
Train a word embedding model - Now that you have each sentence as a pre-processed list of words, you need to train a word-embedding model using your corpus. There are many different algorithms to do that. I would suggest using GenSim and Word2Vec or fastText. What you can also do is using pre-trained word embeddings, like GloVe or anything that best fits your corpus in terms of language/context. Either way, this will allow you to:
have one vector of pre-defined size for each and every word in your corpus' vocabulary
get a list of equally-sized vectors for each sentence in your corpus
Adopting a weighting method - Once you have a list of vectors for each sentence in your corpus, and mainly because your sentences vary in length (some have 6 words, some others have 13 words, etc.) what you want to do is getting a single vector for each and every sentence. To do this, what you can do is simply weighting the vectors corresponding to the words in each sentence. You can:
average all vectors
using weights like TF-IDF weights to give some words more importance than others
use other weighting methods...
Training and testing - Finally, all you're left to do is training a model using these vectors, for instance with a linear Support Vector Machine (SVC), and testing the accuracy of your model on a test dataset (you can also use a validation dataset).
My opinion is, if you are going to use a word2vec embedding, use one pre-trained or used generic text to generate it.
Word2vec embedding are usually used to give meaning and context to your text data, if you train an embedding using only your data, it might be biased and not represent a language. And that means it vectors doesn't carry any meaning.
After having your embedding working, you also has to think about what to do with your words, because a sentence has 1 or more words (embedding works at word level), and you want to feed your models with 1 sentence -> 1 vector. not 1 sentences -> N vectors.
People usually average or multiply those vectors so for example, for the sentence "Hello there" and an embedding of 5 dims:
Hello -> [0, 0, .2, .3, .8]
there -> [.1, .2, 0, 0, .5]
AVG Hello there -> [.05, .1, .1, .15, .65]
This is what you want to use for your models!
So instead of using TF-IDF to generate your sentence vectors, use word2vec like this and you shouldn't have any problem. I already work in a text calssification project and we ended usind a self-trained w2v embedding an ExtraTrees model with amazing results.
I'm interested in using tf-idf with FastText library, but have found a logical way to handle the ngrams. I have used tf-idf with SpaCy vectors already for what I have found several examples like these ones:
http://dsgeek.com/2018/02/19/tfidf_vectors.html
https://www.aclweb.org/anthology/P16-1089
http://nadbordrozd.github.io/blog/2016/05/20/text-classification-with-word2vec/
But for FastText library is not that clear to me, since it has a granularity that isn't that intuitive, E.G.
For a general word2vec aproach I will have one vector for each word, I can count the term frequency of that vector and divide its value accordingly.
But for fastText same word will have several n-grams,
"Listen to the latest news summary" will have n-grams generated by a sliding windows like:
lis ist ste ten tot het...
These n-grams are handled internally by the model so when I try:
model["Listen to the latest news summary"]
I get the final vector directly, hence what I have though is to split the text into n-grams before feeding the model like:
model['lis']
model['ist']
model['ten']
And make the tf-idf from there, but that seems like an inefficient approach both, is there a standar way to apply tf-idf to vector n-grams like these.
I would leave FastText deal with trigrams, but keep building the tfidf-weighted embeddings at the word level.
That is, you send
model["Listen"]
model["to"]
model["the"]
...
to FastText, and then use your old code to get the tf-idf weights.
In any case, it would be good to know whether FastText itself considers the word construct when processing a sentence, or it truly only works it as a sequence of trigrams (blending consecutive words). If the latter is true, then for FastText you would lose information by breaking the sentence into separate words.
You are talking about fasttext tokenization step (not fasttext embeddings) which is a (3,6) char-n-gram tokenization, compatible with tfidf. The full step can be computed outside of fasttext quite easily Calculate TF-IDF using sklearn for n-grams in python
For what I understood from your question you are confusing the difference between word embeddings methods (such as word2vec and many other) and Tf-Idf:
Basically Word Embeddings methods are unsupervised models for
generating word vectors. The word vectors generated by this kind of
models are now very popular in NPL tasks. This is because a word
embedding representation of a word captures more information about
a word than just a one-hot representation of the word, since the
former captures semantic similarity of that word to other words
whereas the latter representation of the word is equidistant from all
other words. FastText is another way to implements word embedding (recently opensourced by facebook researcher).
Tf-idf, instead is a scoring scheme for words, that is a measure of how
important a word is to a document.
From a practical usage standpoint, while tf-idf is a simple scoring scheme and that is its key advantage, word embeddings may be a better choice for most tasks where tf-idf is used, particularly when the task can benefit from the semantic similarity captured by word embeddings (e.g. in information retrieval tasks).
Unlike Word2Vec that learn a vector representation of the entire word, FastText learn a representation for each n-gram of the word as you already seen. So the overall word embeddings is the sum of the n-gram representation. Basically FastText model (number of n-grams > number of words), it performs better than Word2Vec and allows rare words to be represented appropriately.
For my standpoint in general It does not make sense use FastText (or any word embeddings methods) together with Tf-Idf. But if you want use Tf-Idf with FastText you must sum all the n-gram that compose your word and use this representation to calculate the Tf-Idf.
I have pre-trained word2vec from gensim. And Using gensim for finding the similarities between words works as expected. But I am having problem in finding the similarities between two different sentences. Using of cosine similarities is not a good option for sentences and Its not giving good result. Soft Cosine similarities in gensim gives a little better results but still, it is also not looking good.
I found WMDsimilarities in gensim. This is a bit better than softcosine and cosine.
I am thinking if there is more option like using deep learning like keras and tensorflow to find the sentences similarities from pre-trained word2vec. I know the classification can be done using word embbeding and this seems somewhat better options but then I need to find a training data and labeled it from the scratch.
So, I am wondering if there is any other option which can be used pre-trained word2vec in keras and get the sentences similarities. Is there way. I am open to any suggestions and advice.
Before reimplementing the wheel I'd suggest to try doc2vec method from gensim, it works quite well and it's easy to use.
To implement it in Keras reusing the embeddings you have computed with gensim:
Store the word embeddings in a file, one word per line with the corresponding embedding. Alternatively you can do as #Paul suggested and skip the step 2 and reuse the layer in step 3.
Load word embeddings into a Keras Embedding layer. You can checkout this Keras tutorial for more details (check how embedding_layer variable is initialized).
Then a sequence to sequence model can be used to compute the embedding of the text. In which you have an encoder that embeds the string and the decoder that converts the embedding back to a string. Here is a Keras tutorial that translates from English to French. You can use a similar process to transform your text into your text and pick the internal embedding for your similarity metric.
You can also have a look how the paragraph to vector model works, you can also implement it using Keras and loading the word embedding weights that you have computed.
I am trying to calculate similarity between two documents which are comprised of more than thousands sentences.
Baseline would be calculating cosine similarity using BOW.
However, I want to capture more of semantic difference between documents.
Hence, I built word embedding and calculated documents similarity by generating document vectors by simply averaging all the word vectors in each of documents and measure cosine similarity between these documents vectors.
However, since the size of each input document is rather big, the results I get from using the method above are very similar to simple BOW cosine similarity.
I have two questions,
Q1. I found gensim module offers soft cosine similarity. But I am having hard time understanding the difference from the methods I used above, and I think it may not be the mechanism to calculate similarity between million pairs of documents.
Q2. I found Doc2Vec by gensim would be more appropriate for my purpose. But I recognized that training Doc2Vec requires more RAM than I have (32GB) (the size of my entire documents is about 100GB). Would there be any way that I train the model with small part(like 20GB of them) of entire corpus, and use this model to calculate pairwise similarities of entire corpus?
If yes, then what would be the desirable train set size, and is there any tutorial that I can follow?
Ad Q1: If the similarity matrix contains the cosine similarities of the word embeddings (which it more or less does, see Equation 4 in SimBow at SemEval-2017 Task 3) and if the word embeddings are L2-normalized, then the SCM (Soft Cosine Measure) is equivalent to averaging the word embeddings (i.e. your baseline). For a proof, see Lemma 3.3 in the Implementation Notes for the SCM. My Gensim implementation of the SCM (1, 2) additionally sparsifies the similarity matrix to keep the memory footprint small and to regularize the embeddings, so you will get slightly different results compared to vanilla SCM. If embedding averaging gives you similar results to simple BOW cosine similarity, I would question the quality of the embeddings.
Ad Q2: Training a Doc2Vec model on the entire dataset for one epoch is equivalent to training a Doc2Vec model on smaller segments of the entire dataset, one epoch for each segment. Just be aware that Doc2Vec uses document ids as a part of the training process, so you must ensure that the ids are still unique after the segmentation (i.e. the first document of the first segment must have a different id than the first document of the second segment).
I know that word2vec in gensim can compute similarity between words. But now I want to compute word similarity using TF-IDF or LSA with gensim. How to do it?
note: Computing document similarity using LSA with gensim is easy: http://radimrehurek.com/gensim/wiki.html
TF-IDF is a weighting scheme so it's not an alternative to LSA.
Imagine your problem as a matrix of "m" terms by "n" documents. Each entry Aij of your matrix represents the weight of term "i" in document "j". This is where you use TF-IDF. To know what to put in each cell of the matrix.
Then if it suits your application you can reduce the dimensions of the matrix using LSA.
I hope this clears a little the issue.