I've been experimenting with NLP, and use the Doc2Vec model.
The aim of my objective, is a forum suggested question feature. For example, If a user types a question it will compare the vector to other questions already asked. So far this has worked ok in the sense of comparing a question to another asked question.
However, I would like to extend this to comparing the body of the question. For example, just like stackoverflow, I'm writing a the description to my question.
I understand that doc2vec represents sentences through paragraph ids. So for my question example I spoke about first, each sentence will be a unique paragraph id. However, with paraphs i.e the body to the question, sentences will have the same id as other sentences apart of the same paragraph.
para = 'This is a sentence. This is another sentence'
[['This','is','a','sentence',tag=[1]], ['This','is','another','sentence',tag=[1]]
I'm wondering how to go about doing this. How can i input a corpus like so:
['It is a nice day today. I wish I was outside in the sun. But I need to work.']
and compare that to another paragraph like this:
['It is a lovely day today. The sun is shining outside today. However, I am working.']
In which I would expect a very close similarity between the two. Does similarity get calculated by sentence to sentence, rather then paragraph to paragraph? i.e.
cosine_sim(['It is a nice day today'],['It is a lovely day today.]
and do this for the other sentences and average out the similarity scores?
Thanks.
EDIT
What I am confused about is using the above sentences, say the vectors are like so
sent1 = [0.23,0.1,0.33...n]
sent2 = [0.78,0.2,-0.6...n]
sent3 = [0.55,-0.5,0.9...n]
#Avergae out these vectors
para = [0.5,0.2,0.3...n]
and using this vector compare to another paragraph using the same process.
I'll presume you're talking about the Doc2Vec model in the Python Gensim library - based on the word2vec-like 'Paragraph Vector' algorithm. (There are many alternate ways to turn a text into a vector, and sometimes other ways, including the very-simple approach of averaging word-vectors together, gets called 'Doc2Vec' also.)
Doc2Vec has no internal idea of sentences or paragraphs. It just considers texts: lists of word tokens. So you decide what-sized chunks of text to provide, and to associate with tag keys: multiword fragments, sentences, paragraphs, sections, chapters, articles, books, whatever.
Every tag you provide during initial bulk training will have an associated vector trained up, and stored in the model, based on the lists-of-words that you provided alongside it. So, you can retrieve those vectors from training via:
d2v_model.dv[tag]
You can also use that trained, frozen model to infer new vectors for new lists-of-words:
d2v_model.infer_vector(list_of_words)
(Note: these words should be preprocessed/tokenized the same was as those during training, and any words not known to the model from training will be silently ignored.)
And, once you have vectors for two different texts, from whatever method, you can compare them via cosine-similarity.
For creating your doc-vectors, you might want to run the question & body together into one text. (If the question is more important, you could even consider training on a pseudotext that's repeats the question more than once, for example both before and after the body.) Or you might want to treat them separately, so that some downstream process can weight question->question similarities differently than body->body or question->body. What's best for your data & goals usually has to be determined via experimentation.
Related
I need to train a model in Python based on word2vec or other models in order to get adjectives which are semantically close to a world.
For example give a word like 'cat' to model and receive adjectives like 'cute', 'nice', etc.
Is there any way?
With any word2vec model – whether you train it on your own data, or download someone else's pre-trained model – you can give it a word like cat and receive back a ranked list of words that are considered 'similar' in its coordinate system.
However, these won't normally be limited to adjectives, as typical word2vec models don't take any note of a word's part-of-speech. So to filter to just adjectives, some options could include:
use a typical word2vec set-of-vectors that is oblivious to part-of-speech, but use some external reference (like say WordNet) to check each returned word, and discard those that can't be adjectives
preprocess a suitable training corpus to label words with their part-of-speech before word2vec training, as is sometimes done. Then your model's tokens will include within them a declared part-of-speech. For example, you'd then no longer have the word good alone as a token, but (depending on what conventions you use) tagged-tokens like good/NOUN & good/ADJ instead. Then, filtering the closest-words to just adjectives is a simple matter of checking for the desired string pattern.
However, the words you receive from any such process based on word2vec might not be precisely what you're looking for. The kinds of 'semantic similarity' captured by word2vec coordinates are driven by how well words predict other nearby words under the model's limitations. Whether these will meet your needs is something you'll have to try; there could be surprises.
For example, words that humans consider antonyms, like hot & cold, will still be relatively close to each other in word2vec models, as they both describe the same aspect of something (its temperature), and often appear in the same surrounding-word contexts.
And, depending on training texts & model training parameters, different word2vec models can sometimes emphasize different kinds of similarity in their rankings. Some have suggested, for example, that using a smaller window can tend to place words that are direct replacements for each other (same syntactic roles) closer together, whereas a larger window will somewhat more bring together words used in the same topical domains (even if they aren't of the same type). Which kind of similarity would be better for your need? I'm not sure; if you have the time/resources, you could compare the quality of results from multiple contrasting models.
I am working on a problem where I have the Text data with around 10,000 documents. I have create a app where if user enters some random comment , It should display all the similar comments/documents present in the training data.
Exactly like in Stack overflow, if you ask an question it shows all related questions asked earlier.
So if anyone has any suggestions how to do it please answer.
Second I am trying LDA(Latent Dirichlet Allocation) algorithm, where I can get the topic with which my new document belongs to, but how will I get the similar documents from training data. Also how shall I choose the num_topics in LDA.
If anyone has any suggestions of algorithms other than LDA , please tell me.
You can try the following:
Doc2vec - this is an extension of the extremely popular word2vec algorithm, which maps words to an N-dimensional vector space such that words that occur in close proximity in your document will occur in close proximity in the vector space. U can use pre-trained word embeddings. Learn more on word2vec here. Doc2vec is an extension of word2vec. This will allow you to map each document to a vector of dimension N. After this, you can use any distance measure to find the most similar documents to an input document.
Word Mover's Distance - This is directly suited to your purpose and also uses word embeddings. I have used it in one of my personal projects and had achieved really good results. Find more about it here
Also, make sure you apply appropriate text cleaning before applying the algorithms. Steps like case normalization, stopword removal, punctuation removal, etc. It really depends on your dataset. Find out more here
I hope this was helpful...
I have a text file with million of rows which I wanted to convert into word vectors and later on I can compare these vectors with a search keyword and see which all texts are closer to the search keyword.
My Dilemma is all the training files that I have seen for the Word2vec are in the form of paragraphs so that each word has some contextual meaning within that file. Now my file here is independent and contains different keywords in each row.
My question is whether is it possible to create word embedding using this text file or not, if not then what's the best approach for searching a matching search keyword in this million of texts
**My File Structure: **
Walmart
Home Depot
Home Depot
Sears
Walmart
Sams Club
GreenMile
Walgreen
Expected
search Text : 'WAL'
Result from My File:
WALGREEN
WALMART
WALMART
Embeddings
Lets step back and understand what is word2vec. Word2vec (like Glove, FastText etc) is a way to represent words as vectors. ML models don't understand words they only understand numbers so when we are dealing with words we would want to convert them into numbers (vectors). One-hot encoding is one naive way of encoding words as vectors. But for a large vocabulary one-hot encoding become too long. Also there is no semantic relationship between one-hot encoded word.
With DL came the distributed representation of words (called word embeddings). One important property of these word embeddings is that the vector distance between related words is small compared to the distance between unrelated words. i.e distance(apple,orange) < distance(apple,cat)
So how are these embedding model trained ? The embedding models are trained on (very) huge corpus of text. When you have huge corpus of text the model will learn that the apple are orange are used (many times) in same context. It will learn that the apple and orange are related. So to train a good embedding model you need huge corpus of text (not independent words because independent words have no context).
However, one rarely trains a word embedding model form scratch because good embedding model are available in open source. However, if your text is domain specific (say medical) then you do a transfer learning on openly available word embeddings.
Out of vocabulary (OOV) words
Word embedding like word2vec and Glove cannot return an embedding for OOV words. However the embeddings like FastText (thanks to #gojom for pointing it out) handle OOV words by breaking them into n-grams of chars and build a vector by summing up subword vectors that would make up the word.
Problem
Coming to your problem,
Case 1: lets say the user enters a word WAL, first of all it is not a valid English word so it will not be in vocabulary and it is hard to mind a meaning full vector to it. Embeddings like FastText handling them by breaking it into n-grams. This approach gives good embeddings for misspelled words or slang.
Case 2: Lets say the user enters a word WALL and if you plan to use vector similarly to find closest word it will never be close to Walmart because semantically they are not related. It will rather be close to words like window, paint, door.
Conclusion
If your search is for semantically similar words, then solution using vector embeddings will be good. On the other hand, if your search is based on lexicons then vectors embeddings will be of no help.
If you wanted to find walmart from a fragment like wal, you'd more likely use something like:
a substring or prefix search through all entries; or
a reverse-index-of-character-n-grams; or
some sort of edit-distance calculated against all entries or a subset of likely candidates
That is, from your example desired output, this is not really a job for word-vectors, even though some algorithms, like FastText, will be able to provide rough vectors for word-fragments based on their overlap with trained words.
If in fact you want to find similar stores, word-vectors might theoretically be useful. But the problem given your example input is that such word-vector algorithms require examples of tokens used in context, from sequences-of-tokens that co-appear in natural-language-like relationships. And you want lots of data featuring varied examples-in-context, to capture subtle gradations of mutual relationships.
While your existing single-column of short entity-names (stores) can't provide that, maybe you have something applicable elsewhere, if you have richer data sources. Some ideas might be:
lists of stores visited by a single customer
lists of stores carrying the same product/UPC
text from a much larger corpus (such as web-crawled text, or maybe Wikipedia) in which there are sufficient in-context usages of each store-name. (You'd just throw out all the other words created from such training - but the vectors for your tokens-of-interest might still be of use in your domain.)
I have a corpus that contains several documents, for example 10 documents. The idea is to compute the similarity between them and combine the most similar ones into one document. So the result may be 4 documents. What I have done so far is that I iterate over the documents and calculate the most two similar documents and combine them into one document and so on until a threshold. I used Word2vec vectors by taking the mean vector for the whole document. The problem is that as i proceed with the iteration, the longer the document the more similar even if its not that similar due to the presence of more words. Any idea on how to approach this problem?
I used google Word2vec model. The reason: The corpus is not big to train a model.
Note : I do not want to use topic modeling for some specifications. Also the documents are really short, more than half of them may be one sentence.
I really appreciate your suggestions.
I am working on a language that is the not english and I have scraped the data from different sources. I have done my preprocessing like punctuation removal, stop-words removal and tokenization. Now I want to extract domain specific lexicons. Let's say that I have data related to sports, entertainment, etc and I want to extract words that are related to these particular fields, like cricket etc, and place them in topics that are closely related. I tried to use lda for this, but I am not getting the correct clusters. Also in the clusters in which a word which is a part of one topic, it also appears in other topics.
How can I improve my results?
# URDU STOP WORDS REMOVAL
doc_clean = []
stopwords_corpus = UrduCorpusReader('./data', ['stopwords-ur.txt'])
stopwords = stopwords_corpus.words()
# print(stopwords)
for infile in (wordlists.fileids()):
words = wordlists.words(infile)
#print(words)
finalized_words = remove_urdu_stopwords(stopwords, words)
doc = doc_clean.append(finalized_words)
print("\n==== WITHOUT STOPWORDS ===========\n")
print(finalized_words)
# making dictionary and corpus
dictionary = corpora.Dictionary(doc_clean)
# convert tokenized documents into a document-term matrix
matrx= [dictionary.doc2bow(text) for text in doc_clean]
# generate LDA model
lda = models.ldamodel.LdaModel(corpus=matrx, id2word=dictionary, num_topics=5, passes=10)
for top in lda.print_topics():
print("\n===topics from files===\n")
print (top)
LDA and its drawbacks: The idea of LDA is to uncover latent topics from your corpus. A drawback of this unsupervised machine learning approach, is that you will end up with topics that may be hard to interpret by humans. Another drawback is that you will most likely end up with some generic topics including words that appear in every document (like 'introduction', 'date', 'author' etc.). Thirdly, you will not be able to uncover latent topics that are simply not present enough. If you have only 1 article about cricket, it will not be recognised by the algorithm.
Why LDA doesn't fit your case:
You are searching for explicit topics like cricket and you want to learn something about cricket vocabulary, correct? However, LDA will output some topics and you need to recognise cricket vocabulary in order to determine that e.g. topic 5 is concerned with cricket. Often times the LDA will identify topics that are mixed with other -related- topics. Keeping this in mind, there are three scenarios:
You don't know anything about cricket, but you are able to identify the topic that's concerned with cricket.
You are a cricket expert and already know the cricket vocabulary
You don't know anything about cricket and are not able to identify the semantic topic that the LDA produced.
In the first case, you will have the problem that you are likely to associate words with cricket, that are actually not related to cricket, because you count on the LDA output to provide high-quality topics that are only concerned with cricket and no other related topics or generic terms. In the second case, you don't need the analysis in the first place, because you already know the cricket vocabulary! The third case is likely when you are relying on your computer to interpret the topics. However, in LDA you always rely on humans to give a semantic interpretation of the output.
So what to do: There's a paper called Targeted Topic Modeling for Focused Analysis (Wang 2016), which tries to identify which documents are concerned with a pre-defined topic (like cricket). If you have a list of topics for which you'd like to get some topic-specific vocabulary (cricket, basketball, romantic comedies, ..), a starting point could be to first identify relevant documents to then proceed and analyse the word-distributions of the documents related to a certain topic.
Note that perhaps there are completely different methods that will perform exactly what you're looking for. If you want to stay in the LDA-related literature, I'm relatively confident that the article I linked is your best shot.
Edit:
If this answer is useful to you, you may find my paper interesting, too. It takes a labeled dataset of academic economics papers (600+ possible labels) and tries various LDA flavours to get the best predictions on new academic papers. The repo contains my code, documentation and also the paper itself